Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.
/ corefx Public archive

Base64 encoding with simd-support #34529

Merged
merged 23 commits into from
May 29, 2019
Merged

Base64 encoding with simd-support #34529

merged 23 commits into from
May 29, 2019

Conversation

gfoidl
Copy link
Member

@gfoidl gfoidl commented Jan 10, 2019

Description

Fixes https://github.com/dotnet/corefx/issues/32365

The code is based and inspired on the C-code from https://github.com/aklomp/base64 wich is licensed under BSD 2-Clause "Simplified" License.
Base64 encoding with SIMD instructions and Base64 decoding with SIMD instructions give an outline of the algorithm, as it is not very intuitive.

I kept the variables, namens, etc. as close as possible to the original code.

A version for Convert.ToBase64String is done in dotnet/coreclr#21833

Benchmarks

As mentioned in https://github.com/dotnet/corefx/issues/32365#issuecomment-443420296 I've created a separate package for base64 encoding / decoding (main motivation was base64url support, and playing with the intrinsics). The code here is more or less an adaption from that code for corefx, but in essence it is the same (at least after JITing the "work"-code is the same).
Therefore I'll show the perf-numbers based on that code.

The benchmarks were done with sizes 5 (mini -- testing overhead), 16 (eg. a Guid), 1000.
HardwareIntrinsicsCustomConfig is used to run the benchmarks with AVX2, SSSE3 and pure scalar.

Summary of results

I'll give a brief summary of the results, as the table is quite large.

For encoding speedups from 10% to 1000% and more (the longer the input, the greater the speedup) are reported. In the scalar case mainly due the elimination of movsxd from the loop.

Decoding doesn't have as huge speedups as encoding has, because the input has to be checked to be valid base64 characters, but still speedups of 500% are shown.
The scalar case shows a regression of 10-20%. For me this is OK, as input-sizes of 5 seem pretty uncommon.

Encode

Benchmark

BenchmarkDotNet=v0.11.3, OS=Windows 10.0.14393.2485 (1607/AnniversaryUpdate/Redstone1), VM=Hyper-V
Intel Xeon CPU E5-2673 v3 2.40GHz, 1 CPU, 2 logical and 2 physical cores
.NET Core SDK=3.0.100-preview-009844
  [Host] : .NET Core 3.0.0-preview-27218-01 (CoreCLR 4.6.27217.02, CoreFX 4.7.18.61304), 64bit RyuJIT
  AVX2   : .NET Core 3.0.0-preview-27218-01 (CoreCLR 4.6.27217.02, CoreFX 4.7.18.61304), 64bit RyuJIT
  SSSE3  : .NET Core 3.0.0-preview-27218-01 (CoreCLR 4.6.27217.02, CoreFX 4.7.18.61304), 64bit RyuJIT
  Scalar : .NET Core 3.0.0-preview-27218-01 (CoreCLR 4.6.27217.02, CoreFX 4.7.18.61304), 64bit RyuJIT

Runtime=Core  
Method Job EnvironmentVariables DataLen Mean Error StdDev Ratio RatioSD
BuffersBase64 AVX2 Empty 5 25.91 ns 0.5388 ns 0.6617 ns 1.00 0.00
gfoidlBase64 AVX2 Empty 5 22.87 ns 0.3431 ns 0.3210 ns 0.89 0.02
BuffersBase64 SSSE3 COMPlus_EnableAVX2=0 5 25.50 ns 0.3564 ns 0.3334 ns 1.00 0.00
gfoidlBase64 SSSE3 COMPlus_EnableAVX2=0 5 23.70 ns 0.5091 ns 0.5000 ns 0.93 0.03
BuffersBase64 Scalar COMPlus_EnableAVX2=0,COMPlus_EnableSSSE3=0 5 25.66 ns 0.1901 ns 0.1587 ns 1.00 0.00
gfoidlBase64 Scalar COMPlus_EnableAVX2=0,COMPlus_EnableSSSE3=0 5 21.54 ns 0.4772 ns 0.4901 ns 0.84 0.02
BuffersBase64 AVX2 Empty 16 41.95 ns 0.3670 ns 0.3433 ns 1.00 0.00
gfoidlBase64 AVX2 Empty 16 25.33 ns 0.5545 ns 0.7773 ns 0.61 0.02
BuffersBase64 SSSE3 COMPlus_EnableAVX2=0 16 42.30 ns 0.8697 ns 1.0016 ns 1.00 0.00
gfoidlBase64 SSSE3 COMPlus_EnableAVX2=0 16 25.91 ns 0.4458 ns 0.4170 ns 0.61 0.02
BuffersBase64 Scalar COMPlus_EnableAVX2=0,COMPlus_EnableSSSE3=0 16 40.91 ns 0.7998 ns 0.7482 ns 1.00 0.00
gfoidlBase64 Scalar COMPlus_EnableAVX2=0,COMPlus_EnableSSSE3=0 16 32.20 ns 0.7008 ns 1.7451 ns 0.84 0.07
BuffersBase64 AVX2 Empty 1000 1,444.22 ns 23.2487 ns 21.7468 ns 1.00 0.00
gfoidlBase64 AVX2 Empty 1000 132.03 ns 0.6599 ns 0.6173 ns 0.09 0.00
BuffersBase64 SSSE3 COMPlus_EnableAVX2=0 1000 1,454.78 ns 16.5361 ns 15.4679 ns 1.00 0.00
gfoidlBase64 SSSE3 COMPlus_EnableAVX2=0 1000 211.23 ns 3.3349 ns 3.1194 ns 0.15 0.00
BuffersBase64 Scalar COMPlus_EnableAVX2=0,COMPlus_EnableSSSE3=0 1000 1,432.28 ns 26.4097 ns 24.7036 ns 1.00 0.00
gfoidlBase64 Scalar COMPlus_EnableAVX2=0,COMPlus_EnableSSSE3=0 1000 1,323.54 ns 21.6584 ns 20.2593 ns 0.92 0.02

Decode

Benchmark

BenchmarkDotNet=v0.11.3, OS=Windows 10.0.14393.2485 (1607/AnniversaryUpdate/Redstone1), VM=Hyper-V
Intel Xeon CPU E5-2673 v3 2.40GHz, 1 CPU, 2 logical and 2 physical cores
.NET Core SDK=3.0.100-preview-009844
  [Host] : .NET Core 3.0.0-preview-27218-01 (CoreCLR 4.6.27217.02, CoreFX 4.7.18.61304), 64bit RyuJIT
  AVX2   : .NET Core 3.0.0-preview-27218-01 (CoreCLR 4.6.27217.02, CoreFX 4.7.18.61304), 64bit RyuJIT
  SSSE3  : .NET Core 3.0.0-preview-27218-01 (CoreCLR 4.6.27217.02, CoreFX 4.7.18.61304), 64bit RyuJIT
  Scalar : .NET Core 3.0.0-preview-27218-01 (CoreCLR 4.6.27217.02, CoreFX 4.7.18.61304), 64bit RyuJIT

Runtime=Core  
Method Job EnvironmentVariables DataLen Mean Error StdDev Ratio RatioSD
BuffersBase64 AVX2 Empty 5 27.17 ns 0.5911 ns 0.5805 ns 1.00 0.00
gfoidlBase64 AVX2 Empty 5 30.47 ns 0.4374 ns 0.3652 ns 1.12 0.03
BuffersBase64 SSSE3 COMPlus_EnableAVX2=0 5 26.61 ns 0.4777 ns 0.3989 ns 1.00 0.00
gfoidlBase64 SSSE3 COMPlus_EnableAVX2=0 5 31.74 ns 0.6826 ns 1.0425 ns 1.20 0.04
BuffersBase64 Scalar COMPlus_EnableAVX2=0,COMPlus_EnableSSSE3=0 5 26.37 ns 0.4735 ns 0.4429 ns 1.00 0.00
gfoidlBase64 Scalar COMPlus_EnableAVX2=0,COMPlus_EnableSSSE3=0 5 30.55 ns 0.6446 ns 0.7674 ns 1.15 0.04
BuffersBase64 AVX2 Empty 16 41.76 ns 0.4361 ns 0.4079 ns 1.00 0.00
gfoidlBase64 AVX2 Empty 16 38.24 ns 0.2959 ns 0.2768 ns 0.92 0.01
BuffersBase64 SSSE3 COMPlus_EnableAVX2=0 16 41.56 ns 0.8188 ns 0.7659 ns 1.00 0.00
gfoidlBase64 SSSE3 COMPlus_EnableAVX2=0 16 37.43 ns 0.7654 ns 0.7517 ns 0.90 0.03
BuffersBase64 Scalar COMPlus_EnableAVX2=0,COMPlus_EnableSSSE3=0 16 40.10 ns 0.4917 ns 0.4600 ns 1.00 0.00
gfoidlBase64 Scalar COMPlus_EnableAVX2=0,COMPlus_EnableSSSE3=0 16 44.10 ns 0.7569 ns 0.6709 ns 1.10 0.02
BuffersBase64 AVX2 Empty 1000 1,252.14 ns 19.1331 ns 17.8971 ns 1.00 0.00
gfoidlBase64 AVX2 Empty 1000 196.22 ns 1.0869 ns 1.0167 ns 0.16 0.00
BuffersBase64 SSSE3 COMPlus_EnableAVX2=0 1000 1,255.62 ns 24.2244 ns 24.8766 ns 1.00 0.00
gfoidlBase64 SSSE3 COMPlus_EnableAVX2=0 1000 269.54 ns 4.0633 ns 3.8008 ns 0.21 0.01
BuffersBase64 Scalar COMPlus_EnableAVX2=0,COMPlus_EnableSSSE3=0 1000 1,260.94 ns 17.9500 ns 16.7905 ns 1.00 0.00
gfoidlBase64 Scalar COMPlus_EnableAVX2=0,COMPlus_EnableSSSE3=0 1000 1,309.93 ns 25.9879 ns 29.9277 ns 1.04 0.03

Notes

Alignment isn't considered in this code (and I'm not aware of a base64 implementation that considers alignment).

For encoding the writes could be cache-aligned, as there are always written four bytes or multiples of four bytes (and 64 % 4 = 0). But

  • it complicates the code quite a lot
  • on hardware that supports SSSE3 / AVX2 the difference should be minimal, if any
  • the scalar code-path may get quite long, so the benefit of vectorization is less

For decoding it is similar, except that there are always read four bytes, so reading could be aligned.

@gfoidl
Copy link
Member Author

gfoidl commented Jan 10, 2019

As the basis for this code is with BSD 2-Clause "Simplified" License

  • can this code be used here?
  • is there any special attributation needed?

@danmoseley
Copy link
Member

I haven't looked at this yet but it sounds like we might need a TPN file in this folder, like eg https://github.com/dotnet/corefx/blob/master/src/System.Private.Xml/tests/Xslt/TestFiles/TestData/THIRD-PARTY-NOTICES

Copy link
Member

@GrabYourPitchforks GrabYourPitchforks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't review the implementation's correctness. This was just a quick skim looking for reliability issues.

do
{
AssertRead<Vector256<sbyte>>(ref src, ref srcStart, sourceLength);
Vector256<sbyte> str = Unsafe.As<byte, Vector256<sbyte>>(ref src);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only use Unsafe.As if this is known to be aligned. Use Unsafe.LoadUnaligned otherwise.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it should not matter, as on x86 unaligned read/writes are emitted, and this section of the code is only executed if SSSE3 or AVX2 is available.

Anyway, I changed it, to make it more obvious that unaligned movs are used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ups, I missed a point -> https://github.com/dotnet/coreclr/issues/21132
(Forgot to copy over the comment for this). So Unsafe.As is the faster option here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW: Would it be "more safe" when Unsafe.As would emit unaligned read / writes?

{
int vectorElements = Unsafe.SizeOf<TVector>();
ref byte readEnd = ref Unsafe.Add(ref src, vectorElements);
ref byte srcEnd = ref Unsafe.Add(ref srcStart, srcLength + 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GC might not be able to track this ref. Since you're checking for error conditions, I recommend using raw pointers rather than GC-tracked refs, rewriting this method in terms of the 'fixed' statement. Also check for boundary conditions like integer overflows.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Same comment for AssertWrite.)

int sourceIndex = 0;
int destIndex = 0;
// max. 2 padding chars
if (destLength + 2 < decodedLength)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

destLength + 2 could integer overflow and result in a negative value being compared. This doesn't appear to be handled properly later in the method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed this -- thanks.

Even without the fix this should not be a problem, as

// This should never overflow since destLength here is less than int.MaxValue / 4 * 3 (i.e. 1610612733)
, but so it is "more correct" and safer anyways. 👍

str = Avx2.PermuteVar8x32(@out, permuteVec).AsSByte();

AssertWrite<Vector256<sbyte>>(ref destBytes, ref destStart, destLength);
Unsafe.As<byte, Vector256<sbyte>>(ref destBytes) = str;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment re: Unsafe.As here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually shouldn't matter, we will only emit an unaligned move here and the above conditions will ensure this only happens on x86 (where hardware support exists).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{
ref byte srcStart = ref src;
ref byte destStart = ref destBytes;
ref byte simdSrcEnd = ref Unsafe.Add(ref src, (IntPtr)((uint)sourceLength - 45 + 1));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When calling Unsafe.Add, it's legal to create a ref that points to just past the end of the buffer. For example, assume I have a Span<byte> span of length 5. Consider the following.

ref byte a = ref MemoryMarshal.GetReference(span); // &span[0], valid ref
ref byte b = ref Unsafe.Add(ref a, 5); // &span[5], valid ref
ref byte c = ref Unsafe.Add(ref b, 1); // &span[6], *invalid* ref

In the above example, both a and b are valid GC-tracked refs. Since b points to memory outside the buffer, it must not be dereferenced. But since b points just beyond the end of the buffer, the GC can still track it, so operations like Unsafe.IsAddressLessThan(a, b) will still work as expected.

c, on the other hand, is further than just beyond the end of the buffer, so the GC cannot track it. If the underlying object moves in memory, the GC is guaranteed to keep a and b in sync, but it makes no such guarantees for c. Therefore comparing a and c (or b and c) against each other results in undefined behavior.

The reason I mention this is that we need to be careful when we're creating refs that might be beyond the bounds of the buffer. I don't know from this particular call site if the call is valid from a GC-tracking perspective, so I wanted to draw your attention to it so that you can verify it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This same comment applies to other instances of Unsafe.Add.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect explanation -- thank you!

It is always within the bounds of the buffer.
At this case here sourceLength is guaranteed to be >= 45, so simdSrcEnd is within the buffer.
The stride is 32, so it is also within the buffer.

The other places have the same guarantees, as for encoding the min length is >= register size and the stride is register size / 4 * 3, so less than the register size.
For decoding the stride is the register size, the min length is register size + max two padding + zeros that are written (see comments here and here, so 24 (SSSE3) or 45 (AVX2).

@jkotas
Copy link
Member

jkotas commented Jan 11, 2019

I haven't looked at this yet but it sounds like we might need a TPN file in this folder, like eg https://github.com/dotnet/corefx/blob/master/src/System.Private.Xml/tests/Xslt/TestFiles/TestData/THIRD-PARTY-NOTICES

These are tests. The shipping bits needs to be done in a different way. https://github.com/dotnet/coreclr/blob/master/Documentation/project-docs/contributing.md#copying-files-from-other-projects describes the proper way to do it.

private static unsafe void AssertRead<TVector>(ref byte src, ref byte srcStart, int srcLength)
{
fixed (byte* pSrc = &src)
fixed (byte* pSrcStart = &srcStart)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just use Unsafe.IsAddressGreaterThan and not need to pin

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was changed in 7115c84 because of #34529 (comment)

Personally I'd like the Unsafe-variant more (just need to fix it to be GC-tracked correctly).

Shall I revert this part?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend pinning once at the entry of the public method, and use regular pointers throughput the rest of the code.

The tricky byref arithmetic is error prone. It is not worth it to use it here. It is worth using it only in the lowest level methods where the few extra instructions that fixed compiles into show up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So make all the code (even the existing scalar one) to use raw-pointers?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you looked where it is losing the cycles? It is more than what I would expect.

Copy link
Member Author

@gfoidl gfoidl Jan 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dasm for the "work" is nearly identical.
For pinning a bit more code is generated, but the slowdown is more than expected and greater than on other places. Maybe it comes from code and loop alignment?

Is there anything what can be tested?
I made several attempts and tweaks to get better results, this is the best I got so far.

The dasm is in https://github.com/gfoidl/Benchmarks/tree/6a15e45d702663e2fd9bdb3cfc1d80794c576e49/corefx/System/Buffers/Text/Base64Benchmarks/results/dasm

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rep stosd in the prolog looks suspicious too me. https://github.com/dotnet/coreclr/issues/13827 maybe similar, but I have to admit that I'm missing knowledge on this area.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the overhead comes from the checks for empty Spans in Snap pinning.

You can try the direct Snap pinning, like fixed (byte* srcBytes = &MemoryMarshal.GetReference(utf8)) to see whether it makes a difference.

Also, the encodingMap can stay on the byref plan for now. Eventually, it should changed to pre-initialized ReadOnlySpan, but that can be done as a separate change.

Copy link
Member Author

@gfoidl gfoidl Jan 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the overhead comes from the checks for empty Spans in Snap pinning.

This overhead is negligible and more or less within noise (see results for encoding, decoding).

With ee47609 (latest commit) we get equal results (encoding, decoding) to the pure byref-version (before 0d2cbc4) (and due to the tweaks, even an improvment for large data).

In the dasm the difference is in the prolog. Excerpt of the diff shown for encoding, decoding is similar:

G_M39726_IG01:
        push     rbp
        push     r15
        push     r14
-       push     r13
        push     r12
        push     rbx
-       sub      rsp, 56
+       sub      rsp, 48
        vzeroupper
-       lea      rbp, [rsp+60H]
-       mov      r12, rcx
-       mov      r13, rdi
-       lea      rdi, [rbp-60H]
-       mov      ecx, 6
+       lea      rbp, [rsp+50H]
        xor      rax, rax
-       rep stosd
-       mov      rcx, r12
-       mov      rdi, r13
-       mov      bword ptr [rbp-38H], rdi
-       mov      qword ptr [rbp-30H], rsi
-       mov      bword ptr [rbp-48H], rdx
-       mov      qword ptr [rbp-40H], rcx
+       mov      qword ptr [rbp-48H], rax
+       mov      qword ptr [rbp-50H], rax
+       mov      bword ptr [rbp-30H], rdi
+       mov      qword ptr [rbp-28H], rsi
+       mov      bword ptr [rbp-40H], rdx
+       mov      qword ptr [rbp-38H], rcx
        mov      rbx, r8
        mov      r14, r9
 
G_M39726_IG02:
-       cmp      dword ptr [rbp-30H], 0
-       ja       SHORT G_M39726_IG04
+       cmp      dword ptr [rbp-28H], 0
+       ja       SHORT G_M39753_IG04
        xor      eax, eax
        mov      dword ptr [rbx], eax
        mov      dword ptr [r14], eax
-; Total bytes of code 1265, prolog size 69 for method Base64:EncodeToUtf8(struct,struct,byref,byref,bool):int
+; Total bytes of code 1209, prolog size 52 for method Base64:EncodeToUtf8(struct,struct,byref,byref,bool):int
 ; ============================================================

Full dasm-diffs: encoding, decoding

Maybe rep stosd was really the cause for this slowdown. A brief enquiry showed that it has quite high startup overhead (~35 cycles) [1], is sensitive to alignment [2]. Further info in [3].
Aside: And it seems that when having three fixed, that rep stosd is issued (didn't investigate deeply, just checked with a simple test).

pre-initialized ReadOnlySpan

I tried this for encoding and decoding (it also works for sbyte).
There is some strange codegen with unnecessary stack spills, redundant loading (here it is displayed as 0xD1FFAB1E, actually it is the same address), and multiple calls to CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE.

ROS as local was also tried, with the same result as ROS as property.


if (utf8.Length == 0)
goto DoneExit;
if (Avx2.IsSupported && maxSrcLength >= 45)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the length here taking into account other potential drawbacks from executing 256-bit instructions?

Some examples are:

  • For unaligned data, reads/writes crossing a cache-line boundary will happen twice as frequently (same with crossing page boundaries, for large enough data)
  • Additional saving/restoring of the upper 128-bits across method call boundaries
  • Additional vzeroupper calls
  • Possible frequency downscaling when executing a "heavy" 256-bit workload (this one tends to make micro-benches look good, but real world scenarios can actually regress)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The length here doesn't take any of your points / potential AVX drawbacks into account.

I thought about aligning writes for encoding / reads for decoding, as they are multiple of fours, and so (in theory) this could be done. But it's not easy to do this without "eating" up quite a lot in a scalar way.

Do you have a suggestion what to do with the length / what it should take specifically into account?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a suggestion what to do with the length / what it should take specifically into account?

The first three can probably just be profiled multiple times with varying ranges of input data. The last one is really hard to determine outside of profiling real-world scenarios.

But it's not easy to do this without "eating" up quite a lot in a scalar way.

Did you look at processing the leading/trailing elements via vectorization as well (which should give you, at most, 2 unaligned read/writes)? I think you might have mentioned trying this on the other thread...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

processing the leading/trailing elements via vectorization

You're correct that I tried this, but w/o success, as the bookkeeping for this produces more overhead than the scalar processing.
Base64 is a bit tricky with vectorization, as not all elements of the vector produce usable results. Some elements are just 0, and will get overwritten in the next iteration / in the scalar remainder. Then there's the padding. To take all this into account produces quite a lot overhead -- or I'm not thinking clever enough to come with an easy and correct solution for this.

@gfoidl
Copy link
Member Author

gfoidl commented Jan 14, 2019

Sorry, pushed 788245d from a WIP branch to do raw-pointers, but there is some regression and I'm investigating why. Reverted back to 5ace580

@karelz
Copy link
Member

karelz commented Mar 4, 2019

@gfoidl what's the status of this PR? It seems to be stuck for 1.5 month now unless I missed something.
What are the next steps?
BTW: If you don't have time to finish it now, that is ok - let's just close it for now and let's reopen it when you do have time. Thanks! (doing pre-spring stale PR cleanup ;))

@gfoidl
Copy link
Member Author

gfoidl commented Mar 4, 2019

@karelz from my point of view it's ready for further review (except the merge conflict due to the comments -- I'll rebase it).

@GrabYourPitchforks can you have another look here?

@gfoidl
Copy link
Member Author

gfoidl commented Mar 4, 2019

Rebased due to conflicts (from #35354)

src/System.Memory/src/System/Buffers/Text/Base64Decoder.cs
src/System.Memory/src/System/Buffers/Text/Base64Encoder.cs 

@danmoseley
Copy link
Member

@GrabYourPitchforks could you please take another look so we can shepherd this to merging?

@gfoidl
Copy link
Member Author

gfoidl commented Mar 11, 2019

In 925f7ed ROSpan is only used for static vector-data, not for the encoding/decoding maps.

https://github.com/gfoidl/corefx/commit/a3dbc990a6d42c6e8c3934be3144913e3067000f is the change for encoding/decoding maps, but (for me) the codegen is not ideal, as the ref to the static data isn't kept in a register.

G_M39788_IG13:
       0FB602               movzx    rax, byte  ptr [rdx]
       440FB64A01           movzx    r9, byte  ptr [rdx+1]
       440FB65202           movzx    r10, byte  ptr [rdx+2]
       C1E010               shl      eax, 16
       41C1E108             shl      r9d, 8
       410BC1               or       eax, r9d
       410BC2               or       eax, r10d
       448BC8               mov      r9d, eax
       41C1E912             shr      r9d, 18
-      460FB60C0F           movzx    r9, byte  ptr [rdi+r9]
+      49BEF71B0898847F0000 mov      r14, 0x7F8498081BF7
+      420FB61C33           movzx    r9, byte  ptr [r9+r14]
       448BD0               mov      r10d, eax
       41C1EA0C             shr      r10d, 12
       4183E23F             and      r10d, 63
-      460FB61417           movzx    r10, byte  ptr [rdi+r10]
+      49BFF71B0898847F0000 mov      r15, 0x7F8498081BF7
+      470FB6343E           movzx    r10, byte  ptr [r10+r15]
       448BD8               mov      r11d, eax
       41C1EB06             shr      r11d, 6
       4183E33F             and      r11d, 63
-      460FB61C1F           movzx    r11, byte  ptr [rdi+r11]
+      49BCF71B0898847F0000 mov      r12, 0x7F8498081BF7
+      470FB63C27           movzx    r11, byte  ptr [r11+r12]
       83E03F               and      eax, 63
+      0FB60407             movzx    rax, byte  ptr [rax+r12]
       41C1E208             shl      r10d, 8
       450BCA               or       r9d, r10d
       41C1E310             shl      r11d, 16
       450BCB               or       r9d, r11d
       C1E018               shl      eax, 24
       410BC1               or       eax, r9d
       8901                 mov      dword ptr [rcx], eax
       4883C203             add      rdx, 3
       4883C104             add      rcx, 4
       493BD0               cmp      rdx, r8
       7290                 jb       SHORT G_M39788_IG13

Is there any hint to the JIT to keep the ref to static data in a register?
As rdi is used in the current asm.

Advantageous is that in the setup of the method (i.e. outside the loops) the code gets less:
G_M39788_IG04:
-       488D7DD0             lea      rdi, bword ptr [rbp-30H]
-       4C8B3F               mov      r15, bword ptr [rdi]
-       4C897DB8             mov      bword ptr [rbp-48H], r15
-       488D7DC0             lea      rdi, bword ptr [rbp-40H]
-       4C8B27               mov      r12, bword ptr [rdi]
-       4C8965B0             mov      bword ptr [rbp-50H], r12
-       48BFA00E8741697F0000 mov      rdi, 0x7F6941870EA0
-       BE04000000           mov      esi, 4
-       E82504C078           call     CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE
-       48B8280B002C697F0000 mov      rax, 0x7F692C000B28
-       488B38               mov      rdi, gword ptr [rax]
-       837F0800             cmp      dword ptr [rdi+8], 0
-       0F86FE030000         jbe      G_M39788_IG25
-       4883C710             add      rdi, 16
-       8B75D8               mov      esi, dword ptr [rbp-28H]
-       8B4DC8               mov      ecx, dword ptr [rbp-38H]
-       81FEFDFFFF5F         cmp      esi, 0x5FFFFFFD
-       7F2C                 jg       SHORT G_M39788_IG06
-       81FEFDFFFF5F         cmp      esi, 0x5FFFFFFD
-       0F87D8030000         ja       G_M39788_IG24
+       488D45D0             lea      rax, bword ptr [rbp-30H]
+       488B38               mov      rdi, bword ptr [rax]
+       48897DB8             mov      bword ptr [rbp-48H], rdi
+       488D45C0             lea      rax, bword ptr [rbp-40H]
+       488B30               mov      rsi, bword ptr [rax]
+       488975B0             mov      bword ptr [rbp-50H], rsi
+       8B4DD8               mov      ecx, dword ptr [rbp-28H]
+       448B55C8             mov      r10d, dword ptr [rbp-38H]
+       81F9FDFFFF5F         cmp      ecx, 0x5FFFFFFD
+       7F2D                 jg       SHORT G_M39793_IG06
+       81F9FDFFFF5F         cmp      ecx, 0x5FFFFFFD
+       0F871E040000         ja       G_M39793_IG24

@benaadams
Copy link
Member

is the change for encoding/decoding maps, but (for me) the codegen is not ideal, as the ref to the static data isn't kept in a register.

Since you are unsafe anyway, fixed the span and use a pointer?

Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few remaining nits, but overall LGTM.

@gfoidl
Copy link
Member Author

gfoidl commented May 27, 2019

Benchmark-results didn't change from #34529 (comment)

dasm for encode
; Assembly listing for method Base64:EncodeToUtf8(struct,struct,byref,byref,bool):int
; Emitting BLENDED_CODE for X64 CPU with AVX - Unix
; optimized code
; rbp based frame
; fully interruptible
; Final local variable assignments
;
;  V00 arg0         [V00    ] (  5,  4   )  struct (16) [rbp-0x30]   do-not-enreg[XSFB] addr-exposed ld-addr-op
;  V01 arg1         [V01    ] (  4,  3   )  struct (16) [rbp-0x40]   do-not-enreg[XSFB] addr-exposed ld-addr-op
;  V02 arg2         [V02,T20] (  6,  4   )   byref  ->   r8
;  V03 arg3         [V03,T21] (  6,  4   )   byref  ->   r9
;  V04 arg4         [V04,T65] (  1,  0.50)    bool  ->  [rbp+0x10]
;  V05 loc0         [V05,T23] (  8,  4   )    long  ->  rdi
;  V06 loc1         [V06    ] (  1,  0.50)   byref  ->  [rbp-0x48]   must-init pinned
;  V07 loc2         [V07,T24] (  6,  3   )    long  ->  rsi
;  V08 loc3         [V08    ] (  1,  0.50)   byref  ->  [rbp-0x50]   must-init pinned
;  V09 loc4         [V09,T25] (  6,  3   )     int  ->  rcx
;  V10 loc5         [V10,T44] (  3,  1.50)     int  ->  r10
;  V11 loc6         [V11,T28] (  4,  2   )     int  ->  rax
;  V12 loc7         [V12,T00] ( 28, 35   )    long  ->  registers   ld-addr-op
;  V13 loc8         [V13,T02] ( 16, 18.50)    long  ->  r10         ld-addr-op
;  V14 loc9         [V14,T26] (  6,  3   )    long  ->  rcx
;  V15 loc10        [V15,T17] (  8,  7.50)    long  ->  r11
;* V16 loc11        [V16,T52] (  0,  0   )   byref  ->  zero-ref
;  V17 loc12        [V17,T09] (  6, 10   )     int  ->  rax
;  V18 loc13        [V18,T22] (  6,  6   )    long  ->  rax
;# V19 OutArgs      [V19    ] (  1,  1   )  lclBlk ( 0) [rsp+0x00]   "OutgoingArgSpace"
;* V20 tmp1         [V20    ] (  0,  0   )  struct (16) zero-ref    "struct address for call/obj"
;* V21 tmp2         [V21    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "Inlining Arg"
;* V22 tmp3         [V22    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "Inlining Arg"
;* V23 tmp4         [V23    ] (  0,  0   )  struct (16) zero-ref    "struct address for call/obj"
;  V24 tmp5         [V24,T76] (  2,  2.50)  simd32  ->  mm0         "Inline stloc first use temp"
;* V25 tmp6         [V25    ] (  0,  0   )  simd32  ->  zero-ref    "struct address for call/obj"
;  V26 tmp7         [V26,T77] (  2,  2.50)  simd32  ->  mm1         "Inline stloc first use temp"
;* V27 tmp8         [V27    ] (  0,  0   )  simd32  ->  zero-ref    "struct address for call/obj"
;  V28 tmp9         [V28,T78] (  2,  2.50)  simd32  ->  mm2         "Inline stloc first use temp"
;* V29 tmp10        [V29    ] (  0,  0   )  simd32  ->  zero-ref    "struct address for call/obj"
;  V30 tmp11        [V30,T79] (  2,  2.50)  simd32  ->  mm3         "Inline stloc first use temp"
;* V31 tmp12        [V31    ] (  0,  0   )  simd32  ->  zero-ref    "struct address for call/obj"
;  V32 tmp13        [V32,T80] (  2,  2.50)  simd32  ->  mm4         "Inline stloc first use temp"
;  V33 tmp14        [V33,T81] (  2,  2.50)  simd32  ->  mm5         "Inline stloc first use temp"
;  V34 tmp15        [V34,T82] (  2,  2.50)  simd32  ->  mm6         "Inline stloc first use temp"
;* V35 tmp16        [V35    ] (  0,  0   )  struct (16) zero-ref    "struct address for call/obj"
;  V36 tmp17        [V36,T83] (  2,  2.50)  simd32  ->  mm7         "Inline stloc first use temp"
;  V37 tmp18        [V37,T10] (  6,  9   )    long  ->  r10         "Inline stloc first use temp"
;  V38 tmp19        [V38,T18] (  5,  7   )    long  ->  rdx         "Inline stloc first use temp"
;  V39 tmp20        [V39,T66] ( 14, 23.50)  simd32  ->  mm8         "Inline stloc first use temp"
;* V40 tmp21        [V40    ] (  0,  0   )  struct (16) zero-ref    "struct address for call/obj"
;  V41 tmp22        [V41,T92] (  2,  2   )  simd32  ->  mm9         "struct address for call/obj"
;  V42 tmp23        [V42,T68] (  2,  4   )  simd32  ->  mm9         "Inline stloc first use temp"
;  V43 tmp24        [V43,T69] (  2,  4   )  simd32  ->  mm9         "Inline stloc first use temp"
;  V44 tmp25        [V44,T70] (  2,  4   )  simd32  ->  mm9         "Inline stloc first use temp"
;  V45 tmp26        [V45,T71] (  2,  4   )  simd32  ->  mm9         "Inline stloc first use temp"
;* V46 tmp27        [V46    ] (  0,  0   )  struct (16) zero-ref    "NewObj constructor temp"
;* V47 tmp28        [V47    ] (  0,  0   )  struct ( 8) zero-ref    "NewObj constructor temp"
;* V48 tmp29        [V48    ] (  0,  0   )  struct (16) zero-ref    "Inlining Arg"
;* V49 tmp30        [V49    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "Inlining Arg"
;* V50 tmp31        [V50    ] (  0,  0   )   byref  ->  zero-ref    "Inlining Arg"
;  V51 tmp32        [V51,T93] (  2,  1   )  simd32  ->  mm1         "Inline return value spill temp"
;  V52 tmp33        [V52,T94] (  2,  1   )  simd16  ->  mm1         "Inline stloc first use temp"
;  V53 tmp34        [V53,T95] (  2,  1   )  simd32  ->  mm2         "Inline return value spill temp"
;  V54 tmp35        [V54,T96] (  2,  1   )  simd16  ->  mm2         "Inline stloc first use temp"
;  V55 tmp36        [V55,T97] (  2,  1   )  simd32  ->  mm3         "Inline return value spill temp"
;  V56 tmp37        [V56,T98] (  2,  1   )  simd16  ->  mm3         "Inline stloc first use temp"
;  V57 tmp38        [V57,T99] (  2,  1   )  simd32  ->  mm4         "Inline return value spill temp"
;  V58 tmp39        [V58,T100] (  2,  1   )  simd16  ->  mm4         "Inline stloc first use temp"
;  V59 tmp40        [V59,T101] (  2,  1   )  simd32  ->  mm5         "Inline return value spill temp"
;  V60 tmp41        [V60,T102] (  2,  1   )  simd16  ->  mm5         "Inline stloc first use temp"
;  V61 tmp42        [V61,T103] (  2,  1   )  simd32  ->  mm6         "Inline return value spill temp"
;  V62 tmp43        [V62,T104] (  2,  1   )  simd16  ->  mm6         "Inline stloc first use temp"
;* V63 tmp44        [V63    ] (  0,  0   )  struct (16) zero-ref    "NewObj constructor temp"
;* V64 tmp45        [V64    ] (  0,  0   )  struct ( 8) zero-ref    "NewObj constructor temp"
;* V65 tmp46        [V65    ] (  0,  0   )  struct (16) zero-ref    "Inlining Arg"
;* V66 tmp47        [V66    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "Inlining Arg"
;* V67 tmp48        [V67    ] (  0,  0   )   byref  ->  zero-ref    "Inlining Arg"
;* V68 tmp49        [V68    ] (  0,  0   )  struct (16) zero-ref    "NewObj constructor temp"
;* V69 tmp50        [V69    ] (  0,  0   )  struct ( 8) zero-ref    "NewObj constructor temp"
;* V70 tmp51        [V70    ] (  0,  0   )  struct (16) zero-ref    "Inlining Arg"
;* V71 tmp52        [V71    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "Inlining Arg"
;* V72 tmp53        [V72    ] (  0,  0   )   byref  ->  zero-ref    "Inlining Arg"
;* V73 tmp54        [V73    ] (  0,  0   )  struct (16) zero-ref    "struct address for call/obj"
;  V74 tmp55        [V74,T84] (  2,  2.50)  simd16  ->  mm0         "Inline stloc first use temp"
;* V75 tmp56        [V75    ] (  0,  0   )  simd16  ->  zero-ref    "struct address for call/obj"
;  V76 tmp57        [V76,T85] (  2,  2.50)  simd16  ->  mm1         "Inline stloc first use temp"
;* V77 tmp58        [V77    ] (  0,  0   )  simd16  ->  zero-ref    "struct address for call/obj"
;  V78 tmp59        [V78,T86] (  2,  2.50)  simd16  ->  mm2         "Inline stloc first use temp"
;* V79 tmp60        [V79    ] (  0,  0   )  simd16  ->  zero-ref    "struct address for call/obj"
;  V80 tmp61        [V80,T87] (  2,  2.50)  simd16  ->  mm3         "Inline stloc first use temp"
;* V81 tmp62        [V81    ] (  0,  0   )  simd16  ->  zero-ref    "struct address for call/obj"
;  V82 tmp63        [V82,T88] (  2,  2.50)  simd16  ->  mm4         "Inline stloc first use temp"
;  V83 tmp64        [V83,T89] (  2,  2.50)  simd16  ->  mm5         "Inline stloc first use temp"
;  V84 tmp65        [V84,T90] (  2,  2.50)  simd16  ->  mm6         "Inline stloc first use temp"
;* V85 tmp66        [V85    ] (  0,  0   )  struct (16) zero-ref    "struct address for call/obj"
;  V86 tmp67        [V86,T91] (  2,  2.50)  simd16  ->  mm7         "Inline stloc first use temp"
;  V87 tmp68        [V87,T11] (  6,  9   )    long  ->  rdx         "Inline stloc first use temp"
;  V88 tmp69        [V88,T19] (  5,  7   )    long  ->  r10         "Inline stloc first use temp"
;  V89 tmp70        [V89,T67] ( 11, 22   )  simd16  ->  mm8         "Inline stloc first use temp"
;  V90 tmp71        [V90,T72] (  2,  4   )  simd16  ->  mm9         "Inline stloc first use temp"
;  V91 tmp72        [V91,T73] (  2,  4   )  simd16  ->  mm9         "Inline stloc first use temp"
;  V92 tmp73        [V92,T74] (  2,  4   )  simd16  ->  mm9         "Inline stloc first use temp"
;  V93 tmp74        [V93,T75] (  2,  4   )  simd16  ->  mm9         "Inline stloc first use temp"
;* V94 tmp75        [V94    ] (  0,  0   )  struct (16) zero-ref    "NewObj constructor temp"
;* V95 tmp76        [V95    ] (  0,  0   )  struct ( 8) zero-ref    "NewObj constructor temp"
;* V96 tmp77        [V96    ] (  0,  0   )  struct (16) zero-ref    "Inlining Arg"
;* V97 tmp78        [V97    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "Inlining Arg"
;* V98 tmp79        [V98    ] (  0,  0   )   byref  ->  zero-ref    "Inlining Arg"
;  V99 tmp80        [V99,T105] (  2,  1   )  simd16  ->  mm1         "Inline return value spill temp"
;  V100 tmp81       [V100,T106] (  2,  1   )  simd16  ->  mm1         "Inline stloc first use temp"
;  V101 tmp82       [V101,T107] (  2,  1   )  simd16  ->  mm2         "Inline return value spill temp"
;  V102 tmp83       [V102,T108] (  2,  1   )  simd16  ->  mm2         "Inline stloc first use temp"
;  V103 tmp84       [V103,T109] (  2,  1   )  simd16  ->  mm3         "Inline return value spill temp"
;  V104 tmp85       [V104,T110] (  2,  1   )  simd16  ->  mm3         "Inline stloc first use temp"
;  V105 tmp86       [V105,T111] (  2,  1   )  simd16  ->  mm4         "Inline return value spill temp"
;  V106 tmp87       [V106,T112] (  2,  1   )  simd16  ->  mm4         "Inline stloc first use temp"
;  V107 tmp88       [V107,T113] (  2,  1   )  simd16  ->  mm5         "Inline return value spill temp"
;  V108 tmp89       [V108,T114] (  2,  1   )  simd16  ->  mm5         "Inline stloc first use temp"
;  V109 tmp90       [V109,T115] (  2,  1   )  simd16  ->  mm6         "Inline return value spill temp"
;  V110 tmp91       [V110,T116] (  2,  1   )  simd16  ->  mm6         "Inline stloc first use temp"
;* V111 tmp92       [V111    ] (  0,  0   )  struct (16) zero-ref    "NewObj constructor temp"
;* V112 tmp93       [V112    ] (  0,  0   )  struct ( 8) zero-ref    "NewObj constructor temp"
;* V113 tmp94       [V113    ] (  0,  0   )  struct (16) zero-ref    "Inlining Arg"
;* V114 tmp95       [V114    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "Inlining Arg"
;* V115 tmp96       [V115    ] (  0,  0   )   byref  ->  zero-ref    "Inlining Arg"
;* V116 tmp97       [V116    ] (  0,  0   )  struct (16) zero-ref    "NewObj constructor temp"
;* V117 tmp98       [V117    ] (  0,  0   )  struct ( 8) zero-ref    "NewObj constructor temp"
;* V118 tmp99       [V118    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "Inlining Arg"
;* V119 tmp100      [V119    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;  V120 tmp101      [V120,T12] (  2,  8   )     int  ->  rbx         "Inline stloc first use temp"
;  V121 tmp102      [V121,T03] (  2, 16   )     int  ->  rax         "impAppendStmt"
;  V122 tmp103      [V122,T13] (  2,  8   )     int  ->  r14         "Inline stloc first use temp"
;  V123 tmp104      [V123,T01] (  5, 20   )     int  ->  rax         "Inline stloc first use temp"
;  V124 tmp105      [V124,T04] (  2, 16   )     int  ->  rbx         "impAppendStmt"
;  V125 tmp106      [V125,T14] (  2,  8   )     int  ->  r14         "Inline stloc first use temp"
;  V126 tmp107      [V126,T15] (  2,  8   )     int  ->  r15         "Inline stloc first use temp"
;  V127 tmp108      [V127,T16] (  2,  8   )     int  ->  rax         "Inline stloc first use temp"
;* V128 tmp109      [V128    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;  V129 tmp110      [V129,T05] (  2, 16   )    long  ->  rbx         "NewObj constructor temp"
;* V130 tmp111      [V130    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;* V131 tmp112      [V131    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;  V132 tmp113      [V132,T06] (  2, 16   )    long  ->  r14         "NewObj constructor temp"
;* V133 tmp114      [V133    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;* V134 tmp115      [V134    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;  V135 tmp116      [V135,T07] (  2, 16   )    long  ->  r15         "NewObj constructor temp"
;* V136 tmp117      [V136    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;* V137 tmp118      [V137    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;  V138 tmp119      [V138,T08] (  2, 16   )    long  ->  rax         "NewObj constructor temp"
;* V139 tmp120      [V139    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;* V140 tmp121      [V140    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;* V141 tmp122      [V141    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;  V142 tmp123      [V142,T45] (  3,  1.50)     int  ->  rcx         "Inline stloc first use temp"
;  V143 tmp124      [V143,T32] (  2,  2   )     int  ->  rax         "impAppendStmt"
;  V144 tmp125      [V144,T48] (  2,  1   )     int  ->  rcx         "Inline stloc first use temp"
;* V145 tmp126      [V145    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;  V146 tmp127      [V146,T33] (  2,  2   )    long  ->  rax         "NewObj constructor temp"
;* V147 tmp128      [V147    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;* V148 tmp129      [V148    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;  V149 tmp130      [V149,T34] (  2,  2   )    long  ->  rcx         "NewObj constructor temp"
;* V150 tmp131      [V150    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;* V151 tmp132      [V151    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;* V152 tmp133      [V152    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;  V153 tmp134      [V153,T49] (  2,  1   )     int  ->  rcx         "Inline stloc first use temp"
;  V154 tmp135      [V154,T35] (  2,  2   )     int  ->  rax         "impAppendStmt"
;  V155 tmp136      [V155,T29] (  4,  2   )     int  ->  rax         "Inline stloc first use temp"
;  V156 tmp137      [V156,T36] (  2,  2   )     int  ->  rcx         "impAppendStmt"
;  V157 tmp138      [V157,T50] (  2,  1   )     int  ->  r11         "Inline stloc first use temp"
;  V158 tmp139      [V158,T51] (  2,  1   )     int  ->  rax         "Inline stloc first use temp"
;* V159 tmp140      [V159    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;  V160 tmp141      [V160,T37] (  2,  2   )    long  ->  rcx         "NewObj constructor temp"
;* V161 tmp142      [V161    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;* V162 tmp143      [V162    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;  V163 tmp144      [V163,T38] (  2,  2   )    long  ->  r11         "NewObj constructor temp"
;* V164 tmp145      [V164    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;* V165 tmp146      [V165    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;  V166 tmp147      [V166,T39] (  2,  2   )    long  ->  rax         "NewObj constructor temp"
;* V167 tmp148      [V167    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;* V168 tmp149      [V168    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;* V169 tmp150      [V169,T46] (  0,  0   )   byref  ->  zero-ref    V20._pointer(offs=0x00) P-INDEP "field V20._pointer (fldOffset=0x0)"
;* V170 tmp151      [V170    ] (  0,  0   )     int  ->  zero-ref    V20._length(offs=0x08) P-INDEP "field V20._length (fldOffset=0x8)"
;  V171 tmp152      [V171,T42] (  3,  1.50)   byref  ->  rdi         V21._pointer(offs=0x00) P-INDEP "field V21._pointer (fldOffset=0x0)"
;* V172 tmp153      [V172    ] (  0,  0   )     int  ->  zero-ref    V21._length(offs=0x08) P-INDEP "field V21._length (fldOffset=0x8)"
;  V173 tmp154      [V173,T43] (  3,  1.50)   byref  ->  rsi         V22._pointer(offs=0x00) P-INDEP "field V22._pointer (fldOffset=0x0)"
;* V174 tmp155      [V174    ] (  0,  0   )     int  ->  zero-ref    V22._length(offs=0x08) P-INDEP "field V22._length (fldOffset=0x8)"
;* V175 tmp156      [V175    ] (  0,  0   )   byref  ->  zero-ref    V23._pointer(offs=0x00) P-INDEP "field V23._pointer (fldOffset=0x0)"
;* V176 tmp157      [V176    ] (  0,  0   )     int  ->  zero-ref    V23._length(offs=0x08) P-INDEP "field V23._length (fldOffset=0x8)"
;* V177 tmp158      [V177    ] (  0,  0   )   byref  ->  zero-ref    V35._pointer(offs=0x00) P-INDEP "field V35._pointer (fldOffset=0x0)"
;* V178 tmp159      [V178    ] (  0,  0   )     int  ->  zero-ref    V35._length(offs=0x08) P-INDEP "field V35._length (fldOffset=0x8)"
;* V179 tmp160      [V179    ] (  0,  0   )   byref  ->  zero-ref    V40._pointer(offs=0x00) P-INDEP "field V40._pointer (fldOffset=0x0)"
;* V180 tmp161      [V180    ] (  0,  0   )     int  ->  zero-ref    V40._length(offs=0x08) P-INDEP "field V40._length (fldOffset=0x8)"
;* V181 tmp162      [V181,T53] (  0,  0   )   byref  ->  zero-ref    V46._pointer(offs=0x00) P-INDEP "field V46._pointer (fldOffset=0x0)"
;* V182 tmp163      [V182    ] (  0,  0   )     int  ->  zero-ref    V46._length(offs=0x08) P-INDEP "field V46._length (fldOffset=0x8)"
;* V183 tmp164      [V183,T54] (  0,  0   )   byref  ->  zero-ref    V47._value(offs=0x00) P-INDEP "field V47._value (fldOffset=0x0)"
;* V184 tmp165      [V184    ] (  0,  0   )   byref  ->  zero-ref    V48._pointer(offs=0x00) P-INDEP "field V48._pointer (fldOffset=0x0)"
;* V185 tmp166      [V185    ] (  0,  0   )     int  ->  zero-ref    V48._length(offs=0x08) P-INDEP "field V48._length (fldOffset=0x8)"
;* V186 tmp167      [V186    ] (  0,  0   )   byref  ->  zero-ref    V49._pointer(offs=0x00) P-INDEP "field V49._pointer (fldOffset=0x0)"
;* V187 tmp168      [V187    ] (  0,  0   )     int  ->  zero-ref    V49._length(offs=0x08) P-INDEP "field V49._length (fldOffset=0x8)"
;* V188 tmp169      [V188,T55] (  0,  0   )   byref  ->  zero-ref    V63._pointer(offs=0x00) P-INDEP "field V63._pointer (fldOffset=0x0)"
;* V189 tmp170      [V189    ] (  0,  0   )     int  ->  zero-ref    V63._length(offs=0x08) P-INDEP "field V63._length (fldOffset=0x8)"
;* V190 tmp171      [V190,T56] (  0,  0   )   byref  ->  zero-ref    V64._value(offs=0x00) P-INDEP "field V64._value (fldOffset=0x0)"
;* V191 tmp172      [V191    ] (  0,  0   )   byref  ->  zero-ref    V65._pointer(offs=0x00) P-INDEP "field V65._pointer (fldOffset=0x0)"
;* V192 tmp173      [V192    ] (  0,  0   )     int  ->  zero-ref    V65._length(offs=0x08) P-INDEP "field V65._length (fldOffset=0x8)"
;* V193 tmp174      [V193    ] (  0,  0   )   byref  ->  zero-ref    V66._pointer(offs=0x00) P-INDEP "field V66._pointer (fldOffset=0x0)"
;* V194 tmp175      [V194    ] (  0,  0   )     int  ->  zero-ref    V66._length(offs=0x08) P-INDEP "field V66._length (fldOffset=0x8)"
;* V195 tmp176      [V195,T57] (  0,  0   )   byref  ->  zero-ref    V68._pointer(offs=0x00) P-INDEP "field V68._pointer (fldOffset=0x0)"
;* V196 tmp177      [V196    ] (  0,  0   )     int  ->  zero-ref    V68._length(offs=0x08) P-INDEP "field V68._length (fldOffset=0x8)"
;* V197 tmp178      [V197,T58] (  0,  0   )   byref  ->  zero-ref    V69._value(offs=0x00) P-INDEP "field V69._value (fldOffset=0x0)"
;* V198 tmp179      [V198    ] (  0,  0   )   byref  ->  zero-ref    V70._pointer(offs=0x00) P-INDEP "field V70._pointer (fldOffset=0x0)"
;* V199 tmp180      [V199    ] (  0,  0   )     int  ->  zero-ref    V70._length(offs=0x08) P-INDEP "field V70._length (fldOffset=0x8)"
;* V200 tmp181      [V200    ] (  0,  0   )   byref  ->  zero-ref    V71._pointer(offs=0x00) P-INDEP "field V71._pointer (fldOffset=0x0)"
;* V201 tmp182      [V201    ] (  0,  0   )     int  ->  zero-ref    V71._length(offs=0x08) P-INDEP "field V71._length (fldOffset=0x8)"
;* V202 tmp183      [V202    ] (  0,  0   )   byref  ->  zero-ref    V73._pointer(offs=0x00) P-INDEP "field V73._pointer (fldOffset=0x0)"
;* V203 tmp184      [V203    ] (  0,  0   )     int  ->  zero-ref    V73._length(offs=0x08) P-INDEP "field V73._length (fldOffset=0x8)"
;* V204 tmp185      [V204    ] (  0,  0   )   byref  ->  zero-ref    V85._pointer(offs=0x00) P-INDEP "field V85._pointer (fldOffset=0x0)"
;* V205 tmp186      [V205    ] (  0,  0   )     int  ->  zero-ref    V85._length(offs=0x08) P-INDEP "field V85._length (fldOffset=0x8)"
;* V206 tmp187      [V206,T59] (  0,  0   )   byref  ->  zero-ref    V94._pointer(offs=0x00) P-INDEP "field V94._pointer (fldOffset=0x0)"
;* V207 tmp188      [V207    ] (  0,  0   )     int  ->  zero-ref    V94._length(offs=0x08) P-INDEP "field V94._length (fldOffset=0x8)"
;* V208 tmp189      [V208,T60] (  0,  0   )   byref  ->  zero-ref    V95._value(offs=0x00) P-INDEP "field V95._value (fldOffset=0x0)"
;* V209 tmp190      [V209    ] (  0,  0   )   byref  ->  zero-ref    V96._pointer(offs=0x00) P-INDEP "field V96._pointer (fldOffset=0x0)"
;* V210 tmp191      [V210    ] (  0,  0   )     int  ->  zero-ref    V96._length(offs=0x08) P-INDEP "field V96._length (fldOffset=0x8)"
;* V211 tmp192      [V211    ] (  0,  0   )   byref  ->  zero-ref    V97._pointer(offs=0x00) P-INDEP "field V97._pointer (fldOffset=0x0)"
;* V212 tmp193      [V212    ] (  0,  0   )     int  ->  zero-ref    V97._length(offs=0x08) P-INDEP "field V97._length (fldOffset=0x8)"
;* V213 tmp194      [V213,T61] (  0,  0   )   byref  ->  zero-ref    V111._pointer(offs=0x00) P-INDEP "field V111._pointer (fldOffset=0x0)"
;* V214 tmp195      [V214    ] (  0,  0   )     int  ->  zero-ref    V111._length(offs=0x08) P-INDEP "field V111._length (fldOffset=0x8)"
;* V215 tmp196      [V215,T62] (  0,  0   )   byref  ->  zero-ref    V112._value(offs=0x00) P-INDEP "field V112._value (fldOffset=0x0)"
;* V216 tmp197      [V216    ] (  0,  0   )   byref  ->  zero-ref    V113._pointer(offs=0x00) P-INDEP "field V113._pointer (fldOffset=0x0)"
;* V217 tmp198      [V217    ] (  0,  0   )     int  ->  zero-ref    V113._length(offs=0x08) P-INDEP "field V113._length (fldOffset=0x8)"
;* V218 tmp199      [V218    ] (  0,  0   )   byref  ->  zero-ref    V114._pointer(offs=0x00) P-INDEP "field V114._pointer (fldOffset=0x0)"
;* V219 tmp200      [V219    ] (  0,  0   )     int  ->  zero-ref    V114._length(offs=0x08) P-INDEP "field V114._length (fldOffset=0x8)"
;* V220 tmp201      [V220,T47] (  0,  0   )   byref  ->  zero-ref    V116._pointer(offs=0x00) P-INDEP "field V116._pointer (fldOffset=0x0)"
;* V221 tmp202      [V221    ] (  0,  0   )     int  ->  zero-ref    V116._length(offs=0x08) P-INDEP "field V116._length (fldOffset=0x8)"
;* V222 tmp203      [V222,T63] (  0,  0   )   byref  ->  zero-ref    V117._value(offs=0x00) P-INDEP "field V117._value (fldOffset=0x0)"
;* V223 tmp204      [V223,T64] (  0,  0   )   byref  ->  zero-ref    V118._pointer(offs=0x00) P-INDEP "field V118._pointer (fldOffset=0x0)"
;* V224 tmp205      [V224    ] (  0,  0   )     int  ->  zero-ref    V118._length(offs=0x08) P-INDEP "field V118._length (fldOffset=0x8)"
;  V225 tmp206      [V225,T30] (  2,  2   )   byref  ->  rax         "BlockOp address local"
;  V226 tmp207      [V226,T40] (  2,  2   )    long  ->  rdi         "Cast away GC"
;  V227 tmp208      [V227,T31] (  2,  2   )   byref  ->  rax         "BlockOp address local"
;  V228 tmp209      [V228,T41] (  2,  2   )    long  ->  rsi         "Cast away GC"
;  V229 rat0        [V229,T27] (  3,  3   )     int  ->  rdx         "ReplaceWithLclVar is creating a new local variable"
;
; Lcl frame size = 48

G_M39793_IG01:
       55                   push     rbp
       4157                 push     r15
       4156                 push     r14
       4154                 push     r12
       53                   push     rbx
       4883EC30             sub      rsp, 48
       C5F877               vzeroupper
       488D6C2450           lea      rbp, [rsp+50H]
       33C0                 xor      rax, rax
       488945B8             mov      qword ptr [rbp-48H], rax
       488945B0             mov      qword ptr [rbp-50H], rax
       48897DD0             mov      bword ptr [rbp-30H], rdi
       488975D8             mov      qword ptr [rbp-28H], rsi
       488955C0             mov      bword ptr [rbp-40H], rdx
       48894DC8             mov      qword ptr [rbp-38H], rcx

G_M39793_IG02:
       837DD800             cmp      dword ptr [rbp-28H], 0
       7718                 ja       SHORT G_M39793_IG04
       33C0                 xor      eax, eax
       418900               mov      dword ptr [r8], eax
       418901               mov      dword ptr [r9], eax

G_M39793_IG03:
       C5F877               vzeroupper
       488D65E0             lea      rsp, [rbp-20H]
       5B                   pop      rbx
       415C                 pop      r12
       415E                 pop      r14
       415F                 pop      r15
       5D                   pop      rbp
       C3                   ret

G_M39793_IG04:
       488D45D0             lea      rax, bword ptr [rbp-30H]
       488B38               mov      rdi, bword ptr [rax]
       48897DB8             mov      bword ptr [rbp-48H], rdi
       488D45C0             lea      rax, bword ptr [rbp-40H]
       488B30               mov      rsi, bword ptr [rax]
       488975B0             mov      bword ptr [rbp-50H], rsi
       8B4DD8               mov      ecx, dword ptr [rbp-28H]
       448B55C8             mov      r10d, dword ptr [rbp-38H]
       81F9FDFFFF5F         cmp      ecx, 0x5FFFFFFD
       7F2D                 jg       SHORT G_M39793_IG06
       81F9FDFFFF5F         cmp      ecx, 0x5FFFFFFD
       0F871E040000         ja       G_M39793_IG24

G_M39793_IG05:
       8D5102               lea      edx, [rcx+2]
       41BB56555555         mov      r11d, 0x55555556
       418BC3               mov      eax, r11d
       F7EA                 imul     edx:eax, edx
       8BC2                 mov      eax, edx
       C1E81F               shr      eax, 31
       03C2                 add      eax, edx
       C1E002               shl      eax, 2
       413BC2               cmp      eax, r10d
       7F04                 jg       SHORT G_M39793_IG06
       8BC1                 mov      eax, ecx
       EB08                 jmp      SHORT G_M39793_IG07

G_M39793_IG06:
       41C1FA02             sar      r10d, 2
       438D0452             lea      eax, [r10+2*r10]

G_M39793_IG07:
       488BD7               mov      rdx, rdi
       4C8BD6               mov      r10, rsi
       8BC9                 mov      ecx, ecx
       4803CA               add      rcx, rdx
       448BD8               mov      r11d, eax
       4C03DA               add      r11, rdx
       83F810               cmp      eax, 16
       0F8CF1010000         jl       G_M39793_IG12
       498D43E0             lea      rax, [r11-32]
       483BC7               cmp      rax, rdi
       0F8202010000         jb       G_M39793_IG10
       48BA67DD95C2D27F0000 mov      rdx, 0x7FD2C295DD67
       C5FD1002             vmovupd  ymm0, ymmword ptr[rdx]
       41BA00FCC00F         mov      r10d, 0xFC0FC00
       C4C1796ECA           vmovd    xmm1, r10d
       C4E27D58C9           vpbroadcastd ymm1, ymm1
       BAF0033F00           mov      edx, 0x3F03F0
       C5F96ED2             vmovd    xmm2, edx
       C4E27D58D2           vpbroadcastd ymm2, ymm2
       BA40000004           mov      edx, 0x4000040
       C5F96EDA             vmovd    xmm3, edx
       C4E27D58DB           vpbroadcastd ymm3, ymm3
       BA10000001           mov      edx, 0x1000010
       C5F96EE2             vmovd    xmm4, edx
       C4E27D58E4           vpbroadcastd ymm4, ymm4
       BA33000000           mov      edx, 51
       C5F96EEA             vmovd    xmm5, edx
       C4E27D78ED           vpbroadcastb ymm5, ymm5
       BA19000000           mov      edx, 25
       C5F96EF2             vmovd    xmm6, edx
       C4E27D78F6           vpbroadcastb ymm6, ymm6
       48BA87DE95C2D27F0000 mov      rdx, 0x7FD2C295DE87
       C5FD103A             vmovupd  ymm7, ymmword ptr[rdx]
       488BD6               mov      rdx, rsi
       C57E6F07             vmovdqu  ymm8, ymmword ptr[rdi]
       49BA0FDE95C2D27F0000 mov      r10, 0x7FD2C295DE0F
       C4417D100A           vmovupd  ymm9, ymmword ptr[r10]
       C4423536C0           vpermd   ymm8, ymm9, ymm8
       4C8D57FC             lea      r10, [rdi-4]

G_M39793_IG08:
       C4623D00C0           vpshufb  ymm8, ymm8, ymm0
       C53DDBCA             vpand    ymm9, ymm8, ymm2
       C535D5CC             vpmullw  ymm9, ymm9, ymm4
       C53DDBC1             vpand    ymm8, ymm8, ymm1
       C53DE4C3             vpmulhuw ymm8, ymm8, ymm3
       C4413DEBC1           vpor     ymm8, ymm8, ymm9
       C53D64CE             vpcmpgtb ymm9, ymm8, ymm6
       C53DD8D5             vpsubusb ymm10, ymm8, ymm5
       C4412DF8C9           vpsubb   ymm9, ymm10, ymm9
       C4424500C9           vpshufb  ymm9, ymm7, ymm9
       C4413DFCC1           vpaddb   ymm8, ymm8, ymm9
       C57E7F02             vmovdqu  ymmword ptr[rdx], ymm8
       4983C218             add      r10, 24
       4883C220             add      rdx, 32
       4C3BD0               cmp      r10, rax
       770E                 ja       SHORT G_M39793_IG09
       C4417E6F02           vmovdqu  ymm8, ymmword ptr[r10]
       EBB7                 jmp      SHORT G_M39793_IG08

G_M39793_IG09:
       498D4204             lea      rax, [r10+4]
       4C8BD2               mov      r10, rdx
       483BC1               cmp      rax, rcx
       0F845B020000         je       G_M39793_IG17
       488BD0               mov      rdx, rax

G_M39793_IG10:
       498D43F0             lea      rax, [r11-16]
       483BC2               cmp      rax, rdx
       0F82D5000000         jb       G_M39793_IG12
       48BBB7DD95C2D27F0000 mov      rbx, 0x7FD2C295DDB7
       C5F91003             vmovupd  xmm0, xmmword ptr [rbx]
       BB00FCC00F           mov      ebx, 0xFC0FC00
       C5F96ECB             vmovd    xmm1, ebx
       C4E27958C9           vpbroadcastd xmm1, xmm1
       BBF0033F00           mov      ebx, 0x3F03F0
       C5F96ED3             vmovd    xmm2, ebx
       C4E27958D2           vpbroadcastd xmm2, xmm2
       BB40000004           mov      ebx, 0x4000040
       C5F96EDB             vmovd    xmm3, ebx
       C4E27958DB           vpbroadcastd xmm3, xmm3
       BB10000001           mov      ebx, 0x1000010
       C5F96EE3             vmovd    xmm4, ebx
       C4E27958E4           vpbroadcastd xmm4, xmm4
       BB33000000           mov      ebx, 51
       C5F96EEB             vmovd    xmm5, ebx
       C4E27978ED           vpbroadcastb xmm5, xmm5
       BB19000000           mov      ebx, 25
       C5F96EF3             vmovd    xmm6, ebx
       C4E27978F6           vpbroadcastb xmm6, xmm6
       48BB3FDE95C2D27F0000 mov      rbx, 0x7FD2C295DE3F
       C5F9103B             vmovupd  xmm7, xmmword ptr [rbx]

G_M39793_IG11:
       C57A6F02             vmovdqu  xmm8, xmmword ptr [rdx]
       C4623900C0           vpshufb  xmm8, xmm8, xmm0
       C539DBCA             vpand    xmm9, xmm8, xmm2
       C531D5CC             vpmullw  xmm9, xmm9, xmm4
       C539DBC1             vpand    xmm8, xmm8, xmm1
       C539E4C3             vpmulhuw xmm8, xmm8, xmm3
       C44139EBC1           vpor     xmm8, xmm8, xmm9
       C53964CE             vpcmpgtb xmm9, xmm8, xmm6
       C539D8D5             vpsubusb xmm10, xmm8, xmm5
       C44129F8C9           vpsubb   xmm9, xmm10, xmm9
       C4424100C9           vpshufb  xmm9, xmm7, xmm9
       C44139FCC1           vpaddb   xmm8, xmm8, xmm9
       C4417A7F02           vmovdqu  xmmword ptr [r10], xmm8
       4883C20C             add      rdx, 12
       4983C210             add      r10, 16
       483BD0               cmp      rdx, rax
       76B9                 jbe      SHORT G_M39793_IG11
       483BD1               cmp      rdx, rcx
       0F8405010000         je       G_M39793_IG15

G_M39793_IG12:
       4983C3FE             add      r11, -2
       493BD3               cmp      rdx, r11
       0F838E000000         jae      G_M39793_IG14

G_M39793_IG13:
       0FB602               movzx    rax, byte  ptr [rdx]
       0FB65A01             movzx    rbx, byte  ptr [rdx+1]
       440FB67202           movzx    r14, byte  ptr [rdx+2]
       C1E010               shl      eax, 16
       C1E308               shl      ebx, 8
       0BC3                 or       eax, ebx
       410BC6               or       eax, r14d
       8BD8                 mov      ebx, eax
       C1EB12               shr      ebx, 18
       49BE27DC95C2D27F0000 mov      r14, 0x7FD2C295DC27
       420FB61C33           movzx    rbx, byte  ptr [rbx+r14]
       448BF0               mov      r14d, eax
       41C1EE0C             shr      r14d, 12
       4183E63F             and      r14d, 63
       49BF27DC95C2D27F0000 mov      r15, 0x7FD2C295DC27
       470FB6343E           movzx    r14, byte  ptr [r14+r15]
       448BF8               mov      r15d, eax
       41C1EF06             shr      r15d, 6
       4183E73F             and      r15d, 63
       49BC27DC95C2D27F0000 mov      r12, 0x7FD2C295DC27
       470FB63C27           movzx    r15, byte  ptr [r15+r12]
       83E03F               and      eax, 63
       420FB60420           movzx    rax, byte  ptr [rax+r12]
       41C1E608             shl      r14d, 8
       410BDE               or       ebx, r14d
       41C1E710             shl      r15d, 16
       410BDF               or       ebx, r15d
       C1E018               shl      eax, 24
       0BC3                 or       eax, ebx
       418902               mov      dword ptr [r10], eax
       4883C203             add      rdx, 3
       4983C204             add      r10, 4
       493BD3               cmp      rdx, r11
       0F8272FFFFFF         jb       G_M39793_IG13

G_M39793_IG14:
       498D4302             lea      rax, [r11+2]
       483BC1               cmp      rax, rcx
       0F85F4000000         jne      G_M39793_IG20
       807D1000             cmp      byte  ptr [rbp+10H], 0
       0F8411010000         je       G_M39793_IG22
       488D4201             lea      rax, [rdx+1]
       483BC1               cmp      rax, rcx
       7548                 jne      SHORT G_M39793_IG16
       0FB60A               movzx    rcx, byte  ptr [rdx]
       C1E108               shl      ecx, 8
       8BC1                 mov      eax, ecx
       C1E80A               shr      eax, 10
       49BB27DC95C2D27F0000 mov      r11, 0x7FD2C295DC27
       420FB60418           movzx    rax, byte  ptr [rax+r11]
       C1E904               shr      ecx, 4
       83E13F               and      ecx, 63
       420FB60C19           movzx    rcx, byte  ptr [rcx+r11]
       C1E108               shl      ecx, 8
       0BC1                 or       eax, ecx
       0D00003D00           or       eax, 0x3D0000
       0D0000003D           or       eax, 0x3D000000
       418902               mov      dword ptr [r10], eax
       48FFC2               inc      rdx
       4983C204             add      r10, 4
       488BC2               mov      rax, rdx
       EB78                 jmp      SHORT G_M39793_IG17

G_M39793_IG15:
       488BC2               mov      rax, rdx
       EB73                 jmp      SHORT G_M39793_IG17

G_M39793_IG16:
       488D4202             lea      rax, [rdx+2]
       483BC1               cmp      rax, rcx
       0F8587000000         jne      G_M39793_IG19
       0FB602               movzx    rax, byte  ptr [rdx]
       0FB64A01             movzx    rcx, byte  ptr [rdx+1]
       C1E010               shl      eax, 16
       C1E108               shl      ecx, 8
       0BC1                 or       eax, ecx
       8BC8                 mov      ecx, eax
       C1E912               shr      ecx, 18
       49BB27DC95C2D27F0000 mov      r11, 0x7FD2C295DC27
       420FB60C19           movzx    rcx, byte  ptr [rcx+r11]
       448BD8               mov      r11d, eax
       41C1EB0C             shr      r11d, 12
       4183E33F             and      r11d, 63
       48BB27DC95C2D27F0000 mov      rbx, 0x7FD2C295DC27
       450FB61C1B           movzx    r11, byte  ptr [r11+rbx]
       C1E806               shr      eax, 6
       83E03F               and      eax, 63
       0FB60418             movzx    rax, byte  ptr [rax+rbx]
       41C1E308             shl      r11d, 8
       410BCB               or       ecx, r11d
       C1E010               shl      eax, 16
       0BC1                 or       eax, ecx
       0D0000003D           or       eax, 0x3D000000
       418902               mov      dword ptr [r10], eax
       4883C202             add      rdx, 2
       4983C204             add      r10, 4
       488BC2               mov      rax, rdx

G_M39793_IG17:
       482BC7               sub      rax, rdi
       418900               mov      dword ptr [r8], eax
       498BC2               mov      rax, r10
       482BC6               sub      rax, rsi
       418901               mov      dword ptr [r9], eax
       33C0                 xor      eax, eax

G_M39793_IG18:
       C5F877               vzeroupper
       488D65E0             lea      rsp, [rbp-20H]
       5B                   pop      rbx
       415C                 pop      r12
       415E                 pop      r14
       415F                 pop      r15
       5D                   pop      rbp
       C3                   ret

G_M39793_IG19:
       488BC2               mov      rax, rdx
       EBDA                 jmp      SHORT G_M39793_IG17

G_M39793_IG20:
       488BC2               mov      rax, rdx
       482BC7               sub      rax, rdi
       418900               mov      dword ptr [r8], eax
       498BC2               mov      rax, r10
       482BC6               sub      rax, rsi
       418901               mov      dword ptr [r9], eax
       B801000000           mov      eax, 1

G_M39793_IG21:
       C5F877               vzeroupper
       488D65E0             lea      rsp, [rbp-20H]
       5B                   pop      rbx
       415C                 pop      r12
       415E                 pop      r14
       415F                 pop      r15
       5D                   pop      rbp
       C3                   ret

G_M39793_IG22:
       488BC2               mov      rax, rdx
       482BC7               sub      rax, rdi
       418900               mov      dword ptr [r8], eax
       498BC2               mov      rax, r10
       482BC6               sub      rax, rsi
       418901               mov      dword ptr [r9], eax
       B802000000           mov      eax, 2

G_M39793_IG23:
       C5F877               vzeroupper
       488D65E0             lea      rsp, [rbp-20H]
       5B                   pop      rbx
       415C                 pop      r12
       415E                 pop      r14
       415F                 pop      r15
       5D                   pop      rbp
       C3                   ret

G_M39793_IG24:
       33FF                 xor      edi, edi
       E8F0ECFFFF           call     ThrowHelper:ThrowArgumentOutOfRangeException(int)
       CC                   int3

; Total bytes of code 1145, prolog size 46 for method Base64:EncodeToUtf8(struct,struct,byref,byref,bool):int
; ============================================================
dasm for decode
; Assembly listing for method Base64:DecodeFromUtf8(struct,struct,byref,byref,bool):int
; Emitting BLENDED_CODE for X64 CPU with AVX - Unix
; optimized code
; rbp based frame
; fully interruptible
; Final local variable assignments
;
;  V00 arg0         [V00    ] (  7,  5   )  struct (16) [rbp-0x38]   do-not-enreg[XSFB] addr-exposed ld-addr-op
;  V01 arg1         [V01    ] (  4,  3   )  struct (16) [rbp-0x48]   do-not-enreg[XSFB] addr-exposed ld-addr-op
;  V02 arg2         [V02,T20] (  7,  4.50)   byref  ->   r8
;  V03 arg3         [V03,T21] (  7,  4.50)   byref  ->   r9
;  V04 arg4         [V04,T49] (  3,  1.50)    bool  ->  [rbp+0x10]
;  V05 loc0         [V05,T25] (  9,  4.50)    long  ->  rsi
;  V06 loc1         [V06    ] (  1,  0.50)   byref  ->  [rbp-0x50]   must-init pinned
;  V07 loc2         [V07,T26] (  8,  4   )    long  ->  rcx
;  V08 loc3         [V08    ] (  1,  0.50)   byref  ->  [rbp-0x58]   must-init pinned
;  V09 loc4         [V09,T28] (  7,  3.50)     int  ->  [rbp-0x5C]
;  V10 loc5         [V10,T29] (  6,  3   )     int  ->  [rbp-0x60]
;  V11 loc6         [V11,T24] ( 10,  5   )     int  ->  rax
;  V12 loc7         [V12,T50] (  3,  1.50)     int  ->  rbx
;  V13 loc8         [V13,T00] ( 24, 36.50)    long  ->  r14         ld-addr-op
;  V14 loc9         [V14,T01] ( 28, 31.50)    long  ->  r15         ld-addr-op
;  V15 loc10        [V15,T27] (  8,  4   )    long  ->  r12
;  V16 loc11        [V16,T22] (  6,  6.50)    long  ->  rdx
;  V17 loc12        [V17,T51] (  3,  1.50)     int  ->  [rbp-0x64]
;* V18 loc13        [V18,T59] (  0,  0   )   byref  ->  zero-ref
;  V19 loc14        [V19,T56] (  2,  1   )     int  ->  rax
;  V20 loc15        [V20,T57] (  2,  1   )     int  ->  rdi
;  V21 loc16        [V21,T32] (  4,  2   )     int  ->  rdx
;  V22 loc17        [V22,T52] (  3,  1.50)     int  ->  r11
;  V23 loc18        [V23,T10] ( 20, 10   )     int  ->  rax
;  V24 loc19        [V24,T33] (  4,  2   )     int  ->  rdi
;  V25 loc20        [V25,T34] (  4,  2   )    long  ->  rdi
;  V26 loc21        [V26,T23] (  6,  6   )    long  ->  rax
;  V27 loc22        [V27,T03] (  5, 20   )     int  ->  r10
;  V28 loc23        [V28,T35] (  4,  2   )     int  ->  rdx
;  V29 loc24        [V29,T58] (  2,  1   )     int  ->  r11
;  V30 loc25        [V30,T36] (  4,  2   )     int  ->  rdx
;# V31 OutArgs      [V31    ] (  1,  1   )  lclBlk ( 0) [rsp+0x00]   "OutgoingArgSpace"
;  V32 tmp1         [V32,T53] (  3,  1.50)     int  ->  r13
;* V33 tmp2         [V33    ] (  0,  0   )  struct (16) zero-ref    "struct address for call/obj"
;* V34 tmp3         [V34    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "Inlining Arg"
;* V35 tmp4         [V35    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "Inlining Arg"
;* V36 tmp5         [V36    ] (  0,  0   )  struct (16) zero-ref    "struct address for call/obj"
;  V37 tmp6         [V37,T98] (  2,  2.50)  simd32  ->  mm0         "Inline stloc first use temp"
;* V38 tmp7         [V38    ] (  0,  0   )  struct (16) zero-ref    "struct address for call/obj"
;  V39 tmp8         [V39,T99] (  2,  2.50)  simd32  ->  mm1         "Inline stloc first use temp"
;* V40 tmp9         [V40    ] (  0,  0   )  struct (16) zero-ref    "struct address for call/obj"
;  V41 tmp10        [V41,T100] (  2,  2.50)  simd32  ->  mm2         "Inline stloc first use temp"
;* V42 tmp11        [V42    ] (  0,  0   )  struct (16) zero-ref    "struct address for call/obj"
;  V43 tmp12        [V43,T86] (  4,  6.50)  simd32  ->  mm3         "Inline stloc first use temp"
;* V44 tmp13        [V44    ] (  0,  0   )  simd32  ->  zero-ref    "struct address for call/obj"
;  V45 tmp14        [V45,T101] (  2,  2.50)  simd32  ->  mm4         "Inline stloc first use temp"
;* V46 tmp15        [V46    ] (  0,  0   )  simd32  ->  zero-ref    "struct address for call/obj"
;  V47 tmp16        [V47,T102] (  2,  2.50)  simd32  ->  mm5         "Inline stloc first use temp"
;* V48 tmp17        [V48    ] (  0,  0   )  struct (16) zero-ref    "struct address for call/obj"
;  V49 tmp18        [V49,T103] (  2,  2.50)  simd32  ->  mm6         "Inline stloc first use temp"
;* V50 tmp19        [V50    ] (  0,  0   )  struct (16) zero-ref    "struct address for call/obj"
;  V51 tmp20        [V51,T112] (  2,  2   )  simd32  ->  mm7         "struct address for call/obj"
;  V52 tmp21        [V52,T104] (  2,  2.50)  simd32  ->  mm7         "Inline stloc first use temp"
;  V53 tmp22        [V53,T11] (  6,  9   )    long  ->  r14         "Inline stloc first use temp"
;  V54 tmp23        [V54,T18] (  5,  7   )    long  ->  r15         "Inline stloc first use temp"
;  V55 tmp24        [V55,T84] (  9, 18   )  simd32  ->  mm8         "Inline stloc first use temp"
;  V56 tmp25        [V56,T88] (  3,  6   )  simd32  ->  mm9         "Inline stloc first use temp"
;  V57 tmp26        [V57,T90] (  2,  4   )  simd32  ->  mm10         "Inline stloc first use temp"
;  V58 tmp27        [V58,T91] (  2,  4   )  simd32  ->  mm11         "Inline stloc first use temp"
;  V59 tmp28        [V59,T92] (  2,  4   )  simd32  ->  mm10         "Inline stloc first use temp"
;  V60 tmp29        [V60,T93] (  2,  4   )  simd32  ->  mm9         "Inline stloc first use temp"
;* V61 tmp30        [V61    ] (  0,  0   )  struct (16) zero-ref    "NewObj constructor temp"
;* V62 tmp31        [V62    ] (  0,  0   )  struct ( 8) zero-ref    "NewObj constructor temp"
;* V63 tmp32        [V63    ] (  0,  0   )  struct (16) zero-ref    "Inlining Arg"
;* V64 tmp33        [V64    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "Inlining Arg"
;* V65 tmp34        [V65    ] (  0,  0   )   byref  ->  zero-ref    "Inlining Arg"
;* V66 tmp35        [V66    ] (  0,  0   )  struct (16) zero-ref    "NewObj constructor temp"
;* V67 tmp36        [V67    ] (  0,  0   )  struct ( 8) zero-ref    "NewObj constructor temp"
;* V68 tmp37        [V68    ] (  0,  0   )  struct (16) zero-ref    "Inlining Arg"
;* V69 tmp38        [V69    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "Inlining Arg"
;* V70 tmp39        [V70    ] (  0,  0   )   byref  ->  zero-ref    "Inlining Arg"
;* V71 tmp40        [V71    ] (  0,  0   )  struct (16) zero-ref    "NewObj constructor temp"
;* V72 tmp41        [V72    ] (  0,  0   )  struct ( 8) zero-ref    "NewObj constructor temp"
;* V73 tmp42        [V73    ] (  0,  0   )  struct (16) zero-ref    "Inlining Arg"
;* V74 tmp43        [V74    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "Inlining Arg"
;* V75 tmp44        [V75    ] (  0,  0   )   byref  ->  zero-ref    "Inlining Arg"
;* V76 tmp45        [V76    ] (  0,  0   )  struct (16) zero-ref    "NewObj constructor temp"
;* V77 tmp46        [V77    ] (  0,  0   )  struct ( 8) zero-ref    "NewObj constructor temp"
;* V78 tmp47        [V78    ] (  0,  0   )  struct (16) zero-ref    "Inlining Arg"
;* V79 tmp48        [V79    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "Inlining Arg"
;* V80 tmp49        [V80    ] (  0,  0   )   byref  ->  zero-ref    "Inlining Arg"
;  V81 tmp50        [V81,T113] (  2,  1   )  simd32  ->  mm4         "Inline return value spill temp"
;  V82 tmp51        [V82,T114] (  2,  1   )  simd16  ->  mm4         "Inline stloc first use temp"
;  V83 tmp52        [V83,T115] (  2,  1   )  simd32  ->  mm5         "Inline return value spill temp"
;  V84 tmp53        [V84,T116] (  2,  1   )  simd16  ->  mm5         "Inline stloc first use temp"
;* V85 tmp54        [V85    ] (  0,  0   )  struct (16) zero-ref    "NewObj constructor temp"
;* V86 tmp55        [V86    ] (  0,  0   )  struct ( 8) zero-ref    "NewObj constructor temp"
;* V87 tmp56        [V87    ] (  0,  0   )  struct (16) zero-ref    "Inlining Arg"
;* V88 tmp57        [V88    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "Inlining Arg"
;* V89 tmp58        [V89    ] (  0,  0   )   byref  ->  zero-ref    "Inlining Arg"
;* V90 tmp59        [V90    ] (  0,  0   )  struct (16) zero-ref    "NewObj constructor temp"
;* V91 tmp60        [V91    ] (  0,  0   )  struct ( 8) zero-ref    "NewObj constructor temp"
;* V92 tmp61        [V92    ] (  0,  0   )  struct (16) zero-ref    "Inlining Arg"
;* V93 tmp62        [V93    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "Inlining Arg"
;* V94 tmp63        [V94    ] (  0,  0   )   byref  ->  zero-ref    "Inlining Arg"
;* V95 tmp64        [V95    ] (  0,  0   )  struct (16) zero-ref    "struct address for call/obj"
;  V96 tmp65        [V96,T105] (  2,  2.50)  simd16  ->  mm0         "Inline stloc first use temp"
;* V97 tmp66        [V97    ] (  0,  0   )  struct (16) zero-ref    "struct address for call/obj"
;  V98 tmp67        [V98,T106] (  2,  2.50)  simd16  ->  mm1         "Inline stloc first use temp"
;* V99 tmp68        [V99    ] (  0,  0   )  struct (16) zero-ref    "struct address for call/obj"
;  V100 tmp69       [V100,T107] (  2,  2.50)  simd16  ->  mm2         "Inline stloc first use temp"
;* V101 tmp70       [V101    ] (  0,  0   )  struct (16) zero-ref    "struct address for call/obj"
;  V102 tmp71       [V102,T87] (  4,  6.50)  simd16  ->  mm3         "Inline stloc first use temp"
;* V103 tmp72       [V103    ] (  0,  0   )  simd16  ->  zero-ref    "struct address for call/obj"
;  V104 tmp73       [V104,T108] (  2,  2.50)  simd16  ->  mm4         "Inline stloc first use temp"
;* V105 tmp74       [V105    ] (  0,  0   )  simd16  ->  zero-ref    "struct address for call/obj"
;  V106 tmp75       [V106,T109] (  2,  2.50)  simd16  ->  mm5         "Inline stloc first use temp"
;* V107 tmp76       [V107    ] (  0,  0   )  struct (16) zero-ref    "struct address for call/obj"
;  V108 tmp77       [V108,T110] (  2,  2.50)  simd16  ->  mm6         "Inline stloc first use temp"
;  V109 tmp78       [V109,T111] (  2,  2.50)  simd16  ->  mm7         "Inline stloc first use temp"
;  V110 tmp79       [V110,T12] (  6,  9   )    long  ->  r14         "Inline stloc first use temp"
;  V111 tmp80       [V111,T19] (  5,  7   )    long  ->  r15         "Inline stloc first use temp"
;  V112 tmp81       [V112,T85] (  9, 18   )  simd16  ->  mm8         "Inline stloc first use temp"
;  V113 tmp82       [V113,T89] (  3,  6   )  simd16  ->  mm9         "Inline stloc first use temp"
;  V114 tmp83       [V114,T94] (  2,  4   )  simd16  ->  mm10         "Inline stloc first use temp"
;  V115 tmp84       [V115,T95] (  2,  4   )  simd16  ->  mm11         "Inline stloc first use temp"
;  V116 tmp85       [V116,T96] (  2,  4   )  simd16  ->  mm10         "Inline stloc first use temp"
;  V117 tmp86       [V117,T97] (  2,  4   )  simd16  ->  mm9         "Inline stloc first use temp"
;* V118 tmp87       [V118    ] (  0,  0   )  struct (16) zero-ref    "NewObj constructor temp"
;* V119 tmp88       [V119    ] (  0,  0   )  struct ( 8) zero-ref    "NewObj constructor temp"
;* V120 tmp89       [V120    ] (  0,  0   )  struct (16) zero-ref    "Inlining Arg"
;* V121 tmp90       [V121    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "Inlining Arg"
;* V122 tmp91       [V122    ] (  0,  0   )   byref  ->  zero-ref    "Inlining Arg"
;* V123 tmp92       [V123    ] (  0,  0   )  struct (16) zero-ref    "NewObj constructor temp"
;* V124 tmp93       [V124    ] (  0,  0   )  struct ( 8) zero-ref    "NewObj constructor temp"
;* V125 tmp94       [V125    ] (  0,  0   )  struct (16) zero-ref    "Inlining Arg"
;* V126 tmp95       [V126    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "Inlining Arg"
;* V127 tmp96       [V127    ] (  0,  0   )   byref  ->  zero-ref    "Inlining Arg"
;* V128 tmp97       [V128    ] (  0,  0   )  struct (16) zero-ref    "NewObj constructor temp"
;* V129 tmp98       [V129    ] (  0,  0   )  struct ( 8) zero-ref    "NewObj constructor temp"
;* V130 tmp99       [V130    ] (  0,  0   )  struct (16) zero-ref    "Inlining Arg"
;* V131 tmp100      [V131    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "Inlining Arg"
;* V132 tmp101      [V132    ] (  0,  0   )   byref  ->  zero-ref    "Inlining Arg"
;* V133 tmp102      [V133    ] (  0,  0   )  struct (16) zero-ref    "NewObj constructor temp"
;* V134 tmp103      [V134    ] (  0,  0   )  struct ( 8) zero-ref    "NewObj constructor temp"
;* V135 tmp104      [V135    ] (  0,  0   )  struct (16) zero-ref    "Inlining Arg"
;* V136 tmp105      [V136    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "Inlining Arg"
;* V137 tmp106      [V137    ] (  0,  0   )   byref  ->  zero-ref    "Inlining Arg"
;  V138 tmp107      [V138,T117] (  2,  1   )  simd16  ->  mm4         "Inline return value spill temp"
;  V139 tmp108      [V139,T118] (  2,  1   )  simd16  ->  mm4         "Inline stloc first use temp"
;  V140 tmp109      [V140,T119] (  2,  1   )  simd16  ->  mm5         "Inline return value spill temp"
;  V141 tmp110      [V141,T120] (  2,  1   )  simd16  ->  mm5         "Inline stloc first use temp"
;* V142 tmp111      [V142    ] (  0,  0   )  struct (16) zero-ref    "NewObj constructor temp"
;* V143 tmp112      [V143    ] (  0,  0   )  struct ( 8) zero-ref    "NewObj constructor temp"
;* V144 tmp113      [V144    ] (  0,  0   )  struct (16) zero-ref    "Inlining Arg"
;* V145 tmp114      [V145    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "Inlining Arg"
;* V146 tmp115      [V146    ] (  0,  0   )   byref  ->  zero-ref    "Inlining Arg"
;* V147 tmp116      [V147    ] (  0,  0   )  struct (16) zero-ref    "NewObj constructor temp"
;* V148 tmp117      [V148    ] (  0,  0   )  struct ( 8) zero-ref    "NewObj constructor temp"
;* V149 tmp118      [V149    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "Inlining Arg"
;* V150 tmp119      [V150    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;  V151 tmp120      [V151,T13] (  2,  8   )     int  ->  rbx         "Inline stloc first use temp"
;  V152 tmp121      [V152,T14] (  2,  8   )     int  ->  rdi         "Inline stloc first use temp"
;  V153 tmp122      [V153,T15] (  2,  8   )     int  ->  r13         "Inline stloc first use temp"
;  V154 tmp123      [V154,T16] (  2,  8   )     int  ->  r11         "Inline stloc first use temp"
;  V155 tmp124      [V155,T05] (  2, 16   )     int  ->  r10         "impAppendStmt"
;  V156 tmp125      [V156,T02] (  6, 24   )     int  ->  rdi         "Inline stloc first use temp"
;  V157 tmp126      [V157,T04] (  4, 16   )     int  ->  rbx         "Inline stloc first use temp"
;  V158 tmp127      [V158,T17] (  2,  8   )     int  ->  r11         "Inline stloc first use temp"
;* V159 tmp128      [V159    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;  V160 tmp129      [V160,T06] (  2, 16   )    long  ->  rbx         "NewObj constructor temp"
;* V161 tmp130      [V161    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;* V162 tmp131      [V162    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;  V163 tmp132      [V163,T07] (  2, 16   )    long  ->  rdi         "NewObj constructor temp"
;* V164 tmp133      [V164    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;* V165 tmp134      [V165    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;  V166 tmp135      [V166,T08] (  2, 16   )    long  ->  rbx         "NewObj constructor temp"
;* V167 tmp136      [V167    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;* V168 tmp137      [V168    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;  V169 tmp138      [V169,T09] (  2, 16   )    long  ->  r11         "NewObj constructor temp"
;* V170 tmp139      [V170    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;* V171 tmp140      [V171    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;* V172 tmp141      [V172    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;  V173 tmp142      [V173,T39] (  2,  2   )    long  ->  rax         "NewObj constructor temp"
;* V174 tmp143      [V174    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;* V175 tmp144      [V175    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;  V176 tmp145      [V176,T40] (  2,  2   )    long  ->  rdi         "NewObj constructor temp"
;* V177 tmp146      [V177    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;* V178 tmp147      [V178    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;  V179 tmp148      [V179,T41] (  2,  2   )    long  ->  rdx         "NewObj constructor temp"
;* V180 tmp149      [V180    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;* V181 tmp150      [V181    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;  V182 tmp151      [V182,T42] (  2,  2   )    long  ->  r11         "NewObj constructor temp"
;* V183 tmp152      [V183    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;* V184 tmp153      [V184    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;* V185 tmp154      [V185    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;  V186 tmp155      [V186,T43] (  2,  2   )    long  ->  rdx         "NewObj constructor temp"
;* V187 tmp156      [V187    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;  V188 tmp157      [V188,T44] (  2,  2   )     int  ->  rax         "Single return block return value"
;* V189 tmp158      [V189,T54] (  0,  0   )   byref  ->  zero-ref    V33._pointer(offs=0x00) P-INDEP "field V33._pointer (fldOffset=0x0)"
;* V190 tmp159      [V190    ] (  0,  0   )     int  ->  zero-ref    V33._length(offs=0x08) P-INDEP "field V33._length (fldOffset=0x8)"
;  V191 tmp160      [V191,T47] (  3,  1.50)   byref  ->  rsi         V34._pointer(offs=0x00) P-INDEP "field V34._pointer (fldOffset=0x0)"
;* V192 tmp161      [V192    ] (  0,  0   )     int  ->  zero-ref    V34._length(offs=0x08) P-INDEP "field V34._length (fldOffset=0x8)"
;  V193 tmp162      [V193,T48] (  3,  1.50)   byref  ->  rcx         V35._pointer(offs=0x00) P-INDEP "field V35._pointer (fldOffset=0x0)"
;* V194 tmp163      [V194    ] (  0,  0   )     int  ->  zero-ref    V35._length(offs=0x08) P-INDEP "field V35._length (fldOffset=0x8)"
;* V195 tmp164      [V195    ] (  0,  0   )   byref  ->  zero-ref    V36._pointer(offs=0x00) P-INDEP "field V36._pointer (fldOffset=0x0)"
;* V196 tmp165      [V196    ] (  0,  0   )     int  ->  zero-ref    V36._length(offs=0x08) P-INDEP "field V36._length (fldOffset=0x8)"
;* V197 tmp166      [V197    ] (  0,  0   )   byref  ->  zero-ref    V38._pointer(offs=0x00) P-INDEP "field V38._pointer (fldOffset=0x0)"
;* V198 tmp167      [V198    ] (  0,  0   )     int  ->  zero-ref    V38._length(offs=0x08) P-INDEP "field V38._length (fldOffset=0x8)"
;* V199 tmp168      [V199    ] (  0,  0   )   byref  ->  zero-ref    V40._pointer(offs=0x00) P-INDEP "field V40._pointer (fldOffset=0x0)"
;* V200 tmp169      [V200    ] (  0,  0   )     int  ->  zero-ref    V40._length(offs=0x08) P-INDEP "field V40._length (fldOffset=0x8)"
;* V201 tmp170      [V201    ] (  0,  0   )   byref  ->  zero-ref    V42._pointer(offs=0x00) P-INDEP "field V42._pointer (fldOffset=0x0)"
;* V202 tmp171      [V202    ] (  0,  0   )     int  ->  zero-ref    V42._length(offs=0x08) P-INDEP "field V42._length (fldOffset=0x8)"
;* V203 tmp172      [V203    ] (  0,  0   )   byref  ->  zero-ref    V48._pointer(offs=0x00) P-INDEP "field V48._pointer (fldOffset=0x0)"
;* V204 tmp173      [V204    ] (  0,  0   )     int  ->  zero-ref    V48._length(offs=0x08) P-INDEP "field V48._length (fldOffset=0x8)"
;* V205 tmp174      [V205    ] (  0,  0   )   byref  ->  zero-ref    V50._pointer(offs=0x00) P-INDEP "field V50._pointer (fldOffset=0x0)"
;* V206 tmp175      [V206    ] (  0,  0   )     int  ->  zero-ref    V50._length(offs=0x08) P-INDEP "field V50._length (fldOffset=0x8)"
;* V207 tmp176      [V207,T60] (  0,  0   )   byref  ->  zero-ref    V61._pointer(offs=0x00) P-INDEP "field V61._pointer (fldOffset=0x0)"
;* V208 tmp177      [V208    ] (  0,  0   )     int  ->  zero-ref    V61._length(offs=0x08) P-INDEP "field V61._length (fldOffset=0x8)"
;* V209 tmp178      [V209,T61] (  0,  0   )   byref  ->  zero-ref    V62._value(offs=0x00) P-INDEP "field V62._value (fldOffset=0x0)"
;* V210 tmp179      [V210    ] (  0,  0   )   byref  ->  zero-ref    V63._pointer(offs=0x00) P-INDEP "field V63._pointer (fldOffset=0x0)"
;* V211 tmp180      [V211    ] (  0,  0   )     int  ->  zero-ref    V63._length(offs=0x08) P-INDEP "field V63._length (fldOffset=0x8)"
;* V212 tmp181      [V212    ] (  0,  0   )   byref  ->  zero-ref    V64._pointer(offs=0x00) P-INDEP "field V64._pointer (fldOffset=0x0)"
;* V213 tmp182      [V213    ] (  0,  0   )     int  ->  zero-ref    V64._length(offs=0x08) P-INDEP "field V64._length (fldOffset=0x8)"
;* V214 tmp183      [V214,T62] (  0,  0   )   byref  ->  zero-ref    V66._pointer(offs=0x00) P-INDEP "field V66._pointer (fldOffset=0x0)"
;* V215 tmp184      [V215    ] (  0,  0   )     int  ->  zero-ref    V66._length(offs=0x08) P-INDEP "field V66._length (fldOffset=0x8)"
;* V216 tmp185      [V216,T63] (  0,  0   )   byref  ->  zero-ref    V67._value(offs=0x00) P-INDEP "field V67._value (fldOffset=0x0)"
;* V217 tmp186      [V217    ] (  0,  0   )   byref  ->  zero-ref    V68._pointer(offs=0x00) P-INDEP "field V68._pointer (fldOffset=0x0)"
;* V218 tmp187      [V218    ] (  0,  0   )     int  ->  zero-ref    V68._length(offs=0x08) P-INDEP "field V68._length (fldOffset=0x8)"
;* V219 tmp188      [V219    ] (  0,  0   )   byref  ->  zero-ref    V69._pointer(offs=0x00) P-INDEP "field V69._pointer (fldOffset=0x0)"
;* V220 tmp189      [V220    ] (  0,  0   )     int  ->  zero-ref    V69._length(offs=0x08) P-INDEP "field V69._length (fldOffset=0x8)"
;* V221 tmp190      [V221,T64] (  0,  0   )   byref  ->  zero-ref    V71._pointer(offs=0x00) P-INDEP "field V71._pointer (fldOffset=0x0)"
;* V222 tmp191      [V222    ] (  0,  0   )     int  ->  zero-ref    V71._length(offs=0x08) P-INDEP "field V71._length (fldOffset=0x8)"
;* V223 tmp192      [V223,T65] (  0,  0   )   byref  ->  zero-ref    V72._value(offs=0x00) P-INDEP "field V72._value (fldOffset=0x0)"
;* V224 tmp193      [V224    ] (  0,  0   )   byref  ->  zero-ref    V73._pointer(offs=0x00) P-INDEP "field V73._pointer (fldOffset=0x0)"
;* V225 tmp194      [V225    ] (  0,  0   )     int  ->  zero-ref    V73._length(offs=0x08) P-INDEP "field V73._length (fldOffset=0x8)"
;* V226 tmp195      [V226    ] (  0,  0   )   byref  ->  zero-ref    V74._pointer(offs=0x00) P-INDEP "field V74._pointer (fldOffset=0x0)"
;* V227 tmp196      [V227    ] (  0,  0   )     int  ->  zero-ref    V74._length(offs=0x08) P-INDEP "field V74._length (fldOffset=0x8)"
;* V228 tmp197      [V228,T66] (  0,  0   )   byref  ->  zero-ref    V76._pointer(offs=0x00) P-INDEP "field V76._pointer (fldOffset=0x0)"
;* V229 tmp198      [V229    ] (  0,  0   )     int  ->  zero-ref    V76._length(offs=0x08) P-INDEP "field V76._length (fldOffset=0x8)"
;* V230 tmp199      [V230,T67] (  0,  0   )   byref  ->  zero-ref    V77._value(offs=0x00) P-INDEP "field V77._value (fldOffset=0x0)"
;* V231 tmp200      [V231    ] (  0,  0   )   byref  ->  zero-ref    V78._pointer(offs=0x00) P-INDEP "field V78._pointer (fldOffset=0x0)"
;* V232 tmp201      [V232    ] (  0,  0   )     int  ->  zero-ref    V78._length(offs=0x08) P-INDEP "field V78._length (fldOffset=0x8)"
;* V233 tmp202      [V233    ] (  0,  0   )   byref  ->  zero-ref    V79._pointer(offs=0x00) P-INDEP "field V79._pointer (fldOffset=0x0)"
;* V234 tmp203      [V234    ] (  0,  0   )     int  ->  zero-ref    V79._length(offs=0x08) P-INDEP "field V79._length (fldOffset=0x8)"
;* V235 tmp204      [V235,T68] (  0,  0   )   byref  ->  zero-ref    V85._pointer(offs=0x00) P-INDEP "field V85._pointer (fldOffset=0x0)"
;* V236 tmp205      [V236    ] (  0,  0   )     int  ->  zero-ref    V85._length(offs=0x08) P-INDEP "field V85._length (fldOffset=0x8)"
;* V237 tmp206      [V237,T69] (  0,  0   )   byref  ->  zero-ref    V86._value(offs=0x00) P-INDEP "field V86._value (fldOffset=0x0)"
;* V238 tmp207      [V238    ] (  0,  0   )   byref  ->  zero-ref    V87._pointer(offs=0x00) P-INDEP "field V87._pointer (fldOffset=0x0)"
;* V239 tmp208      [V239    ] (  0,  0   )     int  ->  zero-ref    V87._length(offs=0x08) P-INDEP "field V87._length (fldOffset=0x8)"
;* V240 tmp209      [V240    ] (  0,  0   )   byref  ->  zero-ref    V88._pointer(offs=0x00) P-INDEP "field V88._pointer (fldOffset=0x0)"
;* V241 tmp210      [V241    ] (  0,  0   )     int  ->  zero-ref    V88._length(offs=0x08) P-INDEP "field V88._length (fldOffset=0x8)"
;* V242 tmp211      [V242,T70] (  0,  0   )   byref  ->  zero-ref    V90._pointer(offs=0x00) P-INDEP "field V90._pointer (fldOffset=0x0)"
;* V243 tmp212      [V243    ] (  0,  0   )     int  ->  zero-ref    V90._length(offs=0x08) P-INDEP "field V90._length (fldOffset=0x8)"
;* V244 tmp213      [V244,T71] (  0,  0   )   byref  ->  zero-ref    V91._value(offs=0x00) P-INDEP "field V91._value (fldOffset=0x0)"
;* V245 tmp214      [V245    ] (  0,  0   )   byref  ->  zero-ref    V92._pointer(offs=0x00) P-INDEP "field V92._pointer (fldOffset=0x0)"
;* V246 tmp215      [V246    ] (  0,  0   )     int  ->  zero-ref    V92._length(offs=0x08) P-INDEP "field V92._length (fldOffset=0x8)"
;* V247 tmp216      [V247    ] (  0,  0   )   byref  ->  zero-ref    V93._pointer(offs=0x00) P-INDEP "field V93._pointer (fldOffset=0x0)"
;* V248 tmp217      [V248    ] (  0,  0   )     int  ->  zero-ref    V93._length(offs=0x08) P-INDEP "field V93._length (fldOffset=0x8)"
;* V249 tmp218      [V249    ] (  0,  0   )   byref  ->  zero-ref    V95._pointer(offs=0x00) P-INDEP "field V95._pointer (fldOffset=0x0)"
;* V250 tmp219      [V250    ] (  0,  0   )     int  ->  zero-ref    V95._length(offs=0x08) P-INDEP "field V95._length (fldOffset=0x8)"
;* V251 tmp220      [V251    ] (  0,  0   )   byref  ->  zero-ref    V97._pointer(offs=0x00) P-INDEP "field V97._pointer (fldOffset=0x0)"
;* V252 tmp221      [V252    ] (  0,  0   )     int  ->  zero-ref    V97._length(offs=0x08) P-INDEP "field V97._length (fldOffset=0x8)"
;* V253 tmp222      [V253    ] (  0,  0   )   byref  ->  zero-ref    V99._pointer(offs=0x00) P-INDEP "field V99._pointer (fldOffset=0x0)"
;* V254 tmp223      [V254    ] (  0,  0   )     int  ->  zero-ref    V99._length(offs=0x08) P-INDEP "field V99._length (fldOffset=0x8)"
;* V255 tmp224      [V255    ] (  0,  0   )   byref  ->  zero-ref    V101._pointer(offs=0x00) P-INDEP "field V101._pointer (fldOffset=0x0)"
;* V256 tmp225      [V256    ] (  0,  0   )     int  ->  zero-ref    V101._length(offs=0x08) P-INDEP "field V101._length (fldOffset=0x8)"
;* V257 tmp226      [V257    ] (  0,  0   )   byref  ->  zero-ref    V107._pointer(offs=0x00) P-INDEP "field V107._pointer (fldOffset=0x0)"
;* V258 tmp227      [V258    ] (  0,  0   )     int  ->  zero-ref    V107._length(offs=0x08) P-INDEP "field V107._length (fldOffset=0x8)"
;* V259 tmp228      [V259,T72] (  0,  0   )   byref  ->  zero-ref    V118._pointer(offs=0x00) P-INDEP "field V118._pointer (fldOffset=0x0)"
;* V260 tmp229      [V260    ] (  0,  0   )     int  ->  zero-ref    V118._length(offs=0x08) P-INDEP "field V118._length (fldOffset=0x8)"
;* V261 tmp230      [V261,T73] (  0,  0   )   byref  ->  zero-ref    V119._value(offs=0x00) P-INDEP "field V119._value (fldOffset=0x0)"
;* V262 tmp231      [V262    ] (  0,  0   )   byref  ->  zero-ref    V120._pointer(offs=0x00) P-INDEP "field V120._pointer (fldOffset=0x0)"
;* V263 tmp232      [V263    ] (  0,  0   )     int  ->  zero-ref    V120._length(offs=0x08) P-INDEP "field V120._length (fldOffset=0x8)"
;* V264 tmp233      [V264    ] (  0,  0   )   byref  ->  zero-ref    V121._pointer(offs=0x00) P-INDEP "field V121._pointer (fldOffset=0x0)"
;* V265 tmp234      [V265    ] (  0,  0   )     int  ->  zero-ref    V121._length(offs=0x08) P-INDEP "field V121._length (fldOffset=0x8)"
;* V266 tmp235      [V266,T74] (  0,  0   )   byref  ->  zero-ref    V123._pointer(offs=0x00) P-INDEP "field V123._pointer (fldOffset=0x0)"
;* V267 tmp236      [V267    ] (  0,  0   )     int  ->  zero-ref    V123._length(offs=0x08) P-INDEP "field V123._length (fldOffset=0x8)"
;* V268 tmp237      [V268,T75] (  0,  0   )   byref  ->  zero-ref    V124._value(offs=0x00) P-INDEP "field V124._value (fldOffset=0x0)"
;* V269 tmp238      [V269    ] (  0,  0   )   byref  ->  zero-ref    V125._pointer(offs=0x00) P-INDEP "field V125._pointer (fldOffset=0x0)"
;* V270 tmp239      [V270    ] (  0,  0   )     int  ->  zero-ref    V125._length(offs=0x08) P-INDEP "field V125._length (fldOffset=0x8)"
;* V271 tmp240      [V271    ] (  0,  0   )   byref  ->  zero-ref    V126._pointer(offs=0x00) P-INDEP "field V126._pointer (fldOffset=0x0)"
;* V272 tmp241      [V272    ] (  0,  0   )     int  ->  zero-ref    V126._length(offs=0x08) P-INDEP "field V126._length (fldOffset=0x8)"
;* V273 tmp242      [V273,T76] (  0,  0   )   byref  ->  zero-ref    V128._pointer(offs=0x00) P-INDEP "field V128._pointer (fldOffset=0x0)"
;* V274 tmp243      [V274    ] (  0,  0   )     int  ->  zero-ref    V128._length(offs=0x08) P-INDEP "field V128._length (fldOffset=0x8)"
;* V275 tmp244      [V275,T77] (  0,  0   )   byref  ->  zero-ref    V129._value(offs=0x00) P-INDEP "field V129._value (fldOffset=0x0)"
;* V276 tmp245      [V276    ] (  0,  0   )   byref  ->  zero-ref    V130._pointer(offs=0x00) P-INDEP "field V130._pointer (fldOffset=0x0)"
;* V277 tmp246      [V277    ] (  0,  0   )     int  ->  zero-ref    V130._length(offs=0x08) P-INDEP "field V130._length (fldOffset=0x8)"
;* V278 tmp247      [V278    ] (  0,  0   )   byref  ->  zero-ref    V131._pointer(offs=0x00) P-INDEP "field V131._pointer (fldOffset=0x0)"
;* V279 tmp248      [V279    ] (  0,  0   )     int  ->  zero-ref    V131._length(offs=0x08) P-INDEP "field V131._length (fldOffset=0x8)"
;* V280 tmp249      [V280,T78] (  0,  0   )   byref  ->  zero-ref    V133._pointer(offs=0x00) P-INDEP "field V133._pointer (fldOffset=0x0)"
;* V281 tmp250      [V281    ] (  0,  0   )     int  ->  zero-ref    V133._length(offs=0x08) P-INDEP "field V133._length (fldOffset=0x8)"
;* V282 tmp251      [V282,T79] (  0,  0   )   byref  ->  zero-ref    V134._value(offs=0x00) P-INDEP "field V134._value (fldOffset=0x0)"
;* V283 tmp252      [V283    ] (  0,  0   )   byref  ->  zero-ref    V135._pointer(offs=0x00) P-INDEP "field V135._pointer (fldOffset=0x0)"
;* V284 tmp253      [V284    ] (  0,  0   )     int  ->  zero-ref    V135._length(offs=0x08) P-INDEP "field V135._length (fldOffset=0x8)"
;* V285 tmp254      [V285    ] (  0,  0   )   byref  ->  zero-ref    V136._pointer(offs=0x00) P-INDEP "field V136._pointer (fldOffset=0x0)"
;* V286 tmp255      [V286    ] (  0,  0   )     int  ->  zero-ref    V136._length(offs=0x08) P-INDEP "field V136._length (fldOffset=0x8)"
;* V287 tmp256      [V287,T80] (  0,  0   )   byref  ->  zero-ref    V142._pointer(offs=0x00) P-INDEP "field V142._pointer (fldOffset=0x0)"
;* V288 tmp257      [V288    ] (  0,  0   )     int  ->  zero-ref    V142._length(offs=0x08) P-INDEP "field V142._length (fldOffset=0x8)"
;* V289 tmp258      [V289,T81] (  0,  0   )   byref  ->  zero-ref    V143._value(offs=0x00) P-INDEP "field V143._value (fldOffset=0x0)"
;* V290 tmp259      [V290    ] (  0,  0   )   byref  ->  zero-ref    V144._pointer(offs=0x00) P-INDEP "field V144._pointer (fldOffset=0x0)"
;* V291 tmp260      [V291    ] (  0,  0   )     int  ->  zero-ref    V144._length(offs=0x08) P-INDEP "field V144._length (fldOffset=0x8)"
;* V292 tmp261      [V292    ] (  0,  0   )   byref  ->  zero-ref    V145._pointer(offs=0x00) P-INDEP "field V145._pointer (fldOffset=0x0)"
;* V293 tmp262      [V293    ] (  0,  0   )     int  ->  zero-ref    V145._length(offs=0x08) P-INDEP "field V145._length (fldOffset=0x8)"
;* V294 tmp263      [V294,T55] (  0,  0   )   byref  ->  zero-ref    V147._pointer(offs=0x00) P-INDEP "field V147._pointer (fldOffset=0x0)"
;* V295 tmp264      [V295    ] (  0,  0   )     int  ->  zero-ref    V147._length(offs=0x08) P-INDEP "field V147._length (fldOffset=0x8)"
;* V296 tmp265      [V296,T82] (  0,  0   )   byref  ->  zero-ref    V148._value(offs=0x00) P-INDEP "field V148._value (fldOffset=0x0)"
;* V297 tmp266      [V297,T83] (  0,  0   )   byref  ->  zero-ref    V149._pointer(offs=0x00) P-INDEP "field V149._pointer (fldOffset=0x0)"
;* V298 tmp267      [V298    ] (  0,  0   )     int  ->  zero-ref    V149._length(offs=0x08) P-INDEP "field V149._length (fldOffset=0x8)"
;  V299 tmp268      [V299,T37] (  2,  2   )   byref  ->  rax         "BlockOp address local"
;  V300 tmp269      [V300,T45] (  2,  2   )    long  ->  rsi         "Cast away GC"
;  V301 tmp270      [V301,T38] (  2,  2   )   byref  ->  rax         "BlockOp address local"
;  V302 tmp271      [V302,T46] (  2,  2   )    long  ->  rcx         "Cast away GC"
;  V303 rat0        [V303,T30] (  3,  3   )     int  ->  rdx         "ReplaceWithLclVar is creating a new local variable"
;  V304 rat1        [V304,T31] (  3,  3   )     int  ->  rdx         "ReplaceWithLclVar is creating a new local variable"
;
; Lcl frame size = 72

G_M25171_IG01:
       55                   push     rbp
       4157                 push     r15
       4156                 push     r14
       4155                 push     r13
       4154                 push     r12
       53                   push     rbx
       4883EC48             sub      rsp, 72
       C5F877               vzeroupper
       488D6C2470           lea      rbp, [rsp+70H]
       33C0                 xor      rax, rax
       488945B0             mov      qword ptr [rbp-50H], rax
       488945A8             mov      qword ptr [rbp-58H], rax
       48897DC8             mov      bword ptr [rbp-38H], rdi
       488975D0             mov      qword ptr [rbp-30H], rsi
       488955B8             mov      bword ptr [rbp-48H], rdx
       48894DC0             mov      qword ptr [rbp-40H], rcx
       8B7D10               mov      edi, dword ptr [rbp+10H]

G_M25171_IG02:
       837DD000             cmp      dword ptr [rbp-30H], 0
       770D                 ja       SHORT G_M25171_IG03
       33C0                 xor      eax, eax
       418900               mov      dword ptr [r8], eax
       418901               mov      dword ptr [r9], eax
       E982040000           jmp      G_M25171_IG23

G_M25171_IG03:
       488D45C8             lea      rax, bword ptr [rbp-38H]
       488B30               mov      rsi, bword ptr [rax]
       488975B0             mov      bword ptr [rbp-50H], rsi
       488D45B8             lea      rax, bword ptr [rbp-48H]
       488B08               mov      rcx, bword ptr [rax]
       48894DA8             mov      bword ptr [rbp-58H], rcx
       448B55D0             mov      r10d, dword ptr [rbp-30H]
       4183E2FC             and      r10d, -4
       448B5DC0             mov      r11d, dword ptr [rbp-40H]
       418BC2               mov      eax, r10d
       85C0                 test     eax, eax
       0F8CF9040000         jl       G_M25171_IG31

G_M25171_IG04:
       8BD0                 mov      edx, eax
       C1FA02               sar      edx, 2
       8D1C52               lea      ebx, [rdx+2*rdx]
       8D53FE               lea      edx, [rbx-2]
       443BDA               cmp      r11d, edx
       7D14                 jge      SHORT G_M25171_IG05
       BA56555555           mov      edx, 0x55555556
       8BC2                 mov      eax, edx
       41F7EB               imul     edx:eax, r11d
       8BC2                 mov      eax, edx
       C1E81F               shr      eax, 31
       03C2                 add      eax, edx
       C1E002               shl      eax, 2

G_M25171_IG05:
       4C8BF6               mov      r14, rsi
       4C8BF9               mov      r15, rcx
       458BE2               mov      r12d, r10d
       4D03E6               add      r12, r14
       8BD0                 mov      edx, eax
       4903D6               add      rdx, r14
       83F818               cmp      eax, 24
       0F8CF1010000         jl       G_M25171_IG11
       488D42D3             lea      rax, [rdx-45]
       483BC6               cmp      rax, rsi
       0F82F7000000         jb       G_M25171_IG08
       49BED7DD95C2D27F0000 mov      r14, 0x7FD2C295DDD7
       C4C17D1006           vmovupd  ymm0, ymmword ptr[r14]
       49BFE7DB95C2D27F0000 mov      r15, 0x7FD2C295DBE7
       C4C17D100F           vmovupd  ymm1, ymmword ptr[r15]
       49BE07DC95C2D27F0000 mov      r14, 0x7FD2C295DC07
       C4C17D1016           vmovupd  ymm2, ymmword ptr[r14]
       49BE87DD95C2D27F0000 mov      r14, 0x7FD2C295DD87
       C4C17D101E           vmovupd  ymm3, ymmword ptr[r14]
       41BE40014001         mov      r14d, 0x1400140
       C4C1796EE6           vmovd    xmm4, r14d
       C4E27D58E4           vpbroadcastd ymm4, ymm4
       41BE00100100         mov      r14d, 0x11000
       C4C1796EEE           vmovd    xmm5, r14d
       C4E27D58ED           vpbroadcastd ymm5, ymm5
       49BE5FDE95C2D27F0000 mov      r14, 0x7FD2C295DE5F
       C4C17D1036           vmovupd  ymm6, ymmword ptr[r14]
       49BEA7DE95C2D27F0000 mov      r14, 0x7FD2C295DEA7
       C4C17D103E           vmovupd  ymm7, ymmword ptr[r14]
       4C8BF6               mov      r14, rsi
       4C8BF9               mov      r15, rcx

G_M25171_IG06:
       C4417E6F06           vmovdqu  ymm8, ymmword ptr[r14]
       C4C13572D004         vpsrld   ymm9, ymm8, 4
       C535DBCB             vpand    ymm9, ymm9, ymm3
       C53DDBD3             vpand    ymm10, ymm8, ymm3
       C4427D00D9           vpshufb  ymm11, ymm0, ymm9
       C4427500D2           vpshufb  ymm10, ymm1, ymm10
       C4427D17D3           vptest   ymm10, ymm11
       410F94C5             sete     r13b
       450FB6ED             movzx    r13, r13b
       4585ED               test     r13d, r13d
       743D                 je       SHORT G_M25171_IG07
       C53D74D3             vpcmpeqb ymm10, ymm8, ymm3
       C4412DFCC9           vpaddb   ymm9, ymm10, ymm9
       C4426D00C9           vpshufb  ymm9, ymm2, ymm9
       C4413DFCC1           vpaddb   ymm8, ymm8, ymm9
       C4623D04C4           vpmaddubsw ymm8, ymm8, ymm4
       C53DF5C5             vpmaddwd ymm8, ymm8, ymm5
       C4623D00C6           vpshufb  ymm8, ymm8, ymm6
       C4424536C0           vpermd   ymm8, ymm7, ymm8
       C4417E7F07           vmovdqu  ymmword ptr[r15], ymm8
       4983C620             add      r14, 32
       4983C718             add      r15, 24
       4C3BF0               cmp      r14, rax
       7699                 jbe      SHORT G_M25171_IG06

G_M25171_IG07:
       4D3BF4               cmp      r14, r12
       0F8401030000         je       G_M25171_IG22

G_M25171_IG08:
       488D42E8             lea      rax, [rdx-24]
       493BC6               cmp      rax, r14
       0F82E0000000         jb       G_M25171_IG11
       48BAA7DD95C2D27F0000 mov      rdx, 0x7FD2C295DDA7
       C5F91002             vmovupd  xmm0, xmmword ptr [rdx]
       48BA2FDE95C2D27F0000 mov      rdx, 0x7FD2C295DE2F
       C5F9100A             vmovupd  xmm1, xmmword ptr [rdx]
       48BAFFDD95C2D27F0000 mov      rdx, 0x7FD2C295DDFF
       C5F91012             vmovupd  xmm2, xmmword ptr [rdx]
       48BAC7DD95C2D27F0000 mov      rdx, 0x7FD2C295DDC7
       C5F9101A             vmovupd  xmm3, xmmword ptr [rdx]
       BA40014001           mov      edx, 0x1400140
       C5F96EE2             vmovd    xmm4, edx
       C4E27958E4           vpbroadcastd xmm4, xmm4
       BA00100100           mov      edx, 0x11000
       C5F96EEA             vmovd    xmm5, edx
       C4E27958ED           vpbroadcastd xmm5, xmm5
       48BA4FDE95C2D27F0000 mov      rdx, 0x7FD2C295DE4F
       C5F91032             vmovupd  xmm6, xmmword ptr [rdx]
       C5C057FF             vxorps   xmm7, xmm7, xmm7

G_M25171_IG09:
       C4417A6F06           vmovdqu  xmm8, xmmword ptr [r14]
       C4C13172D004         vpsrld   xmm9, xmm8, 4
       C531DBCB             vpand    xmm9, xmm9, xmm3
       C539DBD3             vpand    xmm10, xmm8, xmm3
       C4427900D9           vpshufb  xmm11, xmm0, xmm9
       C4427100D2           vpshufb  xmm10, xmm1, xmm10
       C44129DBD3           vpand    xmm10, xmm10, xmm11
       C52964D7             vpcmpgtb xmm10, xmm10, xmm7
       C4C179D7D2           vpmovmskb edx, xmm10
       85D2                 test     edx, edx
       7538                 jne      SHORT G_M25171_IG10
       C53974D3             vpcmpeqb xmm10, xmm8, xmm3
       C44129FCC9           vpaddb   xmm9, xmm10, xmm9
       C4426900C9           vpshufb  xmm9, xmm2, xmm9
       C44139FCC1           vpaddb   xmm8, xmm8, xmm9
       C4623904C4           vpmaddubsw xmm8, xmm8, xmm4
       C539F5C5             vpmaddwd xmm8, xmm8, xmm5
       C4623900C6           vpshufb  xmm8, xmm8, xmm6
       C4417A7F07           vmovdqu  xmmword ptr [r15], xmm8
       4983C610             add      r14, 16
       4983C70C             add      r15, 12
       4C3BF0               cmp      r14, rax
       769E                 jbe      SHORT G_M25171_IG09

G_M25171_IG10:
       4D3BF4               cmp      r14, r12
       0F8414020000         je       G_M25171_IG22

G_M25171_IG11:
       897D10               mov      dword ptr [rbp+10H], edi
       4084FF               test     dil, dil
       7505                 jne      SHORT G_M25171_IG12
       4533ED               xor      r13d, r13d
       EB06                 jmp      SHORT G_M25171_IG13

G_M25171_IG12:
       41BD04000000         mov      r13d, 4

G_M25171_IG13:
       443BDB               cmp      r11d, ebx
       7C10                 jl       SHORT G_M25171_IG14
       448955A4             mov      dword ptr [rbp-5CH], r10d
       44896D9C             mov      dword ptr [rbp-64H], r13d
       418BC2               mov      eax, r10d
       412BC5               sub      eax, r13d
       EB24                 jmp      SHORT G_M25171_IG15

G_M25171_IG14:
       BA56555555           mov      edx, 0x55555556
       44895DA0             mov      dword ptr [rbp-60H], r11d
       8BC2                 mov      eax, edx
       41F7EB               imul     edx:eax, r11d
       8BC2                 mov      eax, edx
       C1E81F               shr      eax, 31
       03C2                 add      eax, edx
       C1E002               shl      eax, 2
       448955A4             mov      dword ptr [rbp-5CH], r10d
       44896D9C             mov      dword ptr [rbp-64H], r13d
       448B5DA0             mov      r11d, dword ptr [rbp-60H]

G_M25171_IG15:
       8BD0                 mov      edx, eax
       4803D6               add      rdx, rsi
       4C3BF2               cmp      r14, rdx
       44895DA0             mov      dword ptr [rbp-60H], r11d
       0F8392000000         jae      G_M25171_IG17

G_M25171_IG16:
       410FB61E             movzx    rbx, byte  ptr [r14]
       410FB67E01           movzx    rdi, byte  ptr [r14+1]
       450FB66E02           movzx    r13, byte  ptr [r14+2]
       450FB65E03           movzx    r11, byte  ptr [r14+3]
       8BDB                 mov      ebx, ebx
       49BA67DC95C2D27F0000 mov      r10, 0x7FD2C295DC67
       4E0FBE1413           movsx    r10, byte  ptr [rbx+r10]
       8BFF                 mov      edi, edi
       48BB67DC95C2D27F0000 mov      rbx, 0x7FD2C295DC67
       480FBE3C1F           movsx    rdi, byte  ptr [rdi+rbx]
       418BDD               mov      ebx, r13d
       49BD67DC95C2D27F0000 mov      r13, 0x7FD2C295DC67
       4A0FBE1C2B           movsx    rbx, byte  ptr [rbx+r13]
       458BDB               mov      r11d, r11d
       4F0FBE1C2B           movsx    r11, byte  ptr [r11+r13]
       C1E70C               shl      edi, 12
       C1E306               shl      ebx, 6
       0BFB                 or       edi, ebx
       41C1E212             shl      r10d, 18
       450BD3               or       r10d, r11d
       440BD7               or       r10d, edi
       4585D2               test     r10d, r10d
       0F8CD9010000         jl       G_M25171_IG29
       418BFA               mov      edi, r10d
       C1FF10               sar      edi, 16
       41883F               mov      byte  ptr [r15], dil
       418BFA               mov      edi, r10d
       C1FF08               sar      edi, 8
       41887F01             mov      byte  ptr [r15+1], dil
       45885702             mov      byte  ptr [r15+2], r10b
       4983C604             add      r14, 4
       4983C703             add      r15, 3
       4C3BF2               cmp      r14, rdx
       0F826EFFFFFF         jb       G_M25171_IG16

G_M25171_IG17:
       448B55A4             mov      r10d, dword ptr [rbp-5CH]
       418BFA               mov      edi, r10d
       2B7D9C               sub      edi, dword ptr [rbp-64H]
       3BF8                 cmp      edi, eax
       0F8538010000         jne      G_M25171_IG25
       4D3BF4               cmp      r14, r12
       750F                 jne      SHORT G_M25171_IG18
       807D1000             cmp      byte  ptr [rbp+10H], 0
       0F8467010000         je       G_M25171_IG27
       E98B010000           jmp      G_M25171_IG29

G_M25171_IG18:
       410FB64424FC         movzx    rax, byte  ptr [r12-4]
       410FB67C24FD         movzx    rdi, byte  ptr [r12-3]
       410FB65424FE         movzx    rdx, byte  ptr [r12-2]
       450FB65C24FF         movzx    r11, byte  ptr [r12-1]
       8BC0                 mov      eax, eax
       48BB67DC95C2D27F0000 mov      rbx, 0x7FD2C295DC67
       480FBE0418           movsx    rax, byte  ptr [rax+rbx]
       8BFF                 mov      edi, edi
       480FBE3C1F           movsx    rdi, byte  ptr [rdi+rbx]
       C1E012               shl      eax, 18
       C1E70C               shl      edi, 12
       0BC7                 or       eax, edi
       8B7DA0               mov      edi, dword ptr [rbp-60H]
       4803F9               add      rdi, rcx
       4183FB3D             cmp      r11d, 61
       7451                 je       SHORT G_M25171_IG19
       8BD2                 mov      edx, edx
       48BB67DC95C2D27F0000 mov      rbx, 0x7FD2C295DC67
       480FBE141A           movsx    rdx, byte  ptr [rdx+rbx]
       458BDB               mov      r11d, r11d
       4D0FBE1C1B           movsx    r11, byte  ptr [r11+rbx]
       C1E206               shl      edx, 6
       410BC3               or       eax, r11d
       0BC2                 or       eax, edx
       85C0                 test     eax, eax
       0F8C1E010000         jl       G_M25171_IG29
       498D5703             lea      rdx, [r15+3]
       483BD7               cmp      rdx, rdi
       0F87AA000000         ja       G_M25171_IG25
       8BF8                 mov      edi, eax
       C1FF10               sar      edi, 16
       41883F               mov      byte  ptr [r15], dil
       8BF8                 mov      edi, eax
       C1FF08               sar      edi, 8
       41887F01             mov      byte  ptr [r15+1], dil
       41884702             mov      byte  ptr [r15+2], al
       4983C703             add      r15, 3
       EB5B                 jmp      SHORT G_M25171_IG21

G_M25171_IG19:
       83FA3D               cmp      edx, 61
       743C                 je       SHORT G_M25171_IG20
       8BD2                 mov      edx, edx
       49BB67DC95C2D27F0000 mov      r11, 0x7FD2C295DC67
       4A0FBE141A           movsx    rdx, byte  ptr [rdx+r11]
       C1E206               shl      edx, 6
       0BC2                 or       eax, edx
       85C0                 test     eax, eax
       0F8CD3000000         jl       G_M25171_IG29
       498D5702             lea      rdx, [r15+2]
       483BD7               cmp      rdx, rdi
       7763                 ja       SHORT G_M25171_IG25
       8BF8                 mov      edi, eax
       C1FF10               sar      edi, 16
       41883F               mov      byte  ptr [r15], dil
       C1F808               sar      eax, 8
       41884701             mov      byte  ptr [r15+1], al
       4983C702             add      r15, 2
       EB1A                 jmp      SHORT G_M25171_IG21

G_M25171_IG20:
       85C0                 test     eax, eax
       0F8CAD000000         jl       G_M25171_IG29
       498D5701             lea      rdx, [r15+1]
       483BD7               cmp      rdx, rdi
       773D                 ja       SHORT G_M25171_IG25
       C1F810               sar      eax, 16
       418807               mov      byte  ptr [r15], al
       49FFC7               inc      r15

G_M25171_IG21:
       4983C604             add      r14, 4
       443B55D0             cmp      r10d, dword ptr [rbp-30H]
       0F858D000000         jne      G_M25171_IG29

G_M25171_IG22:
       498BC6               mov      rax, r14
       482BC6               sub      rax, rsi
       418900               mov      dword ptr [r8], eax
       498BC7               mov      rax, r15
       482BC1               sub      rax, rcx
       418901               mov      dword ptr [r9], eax

G_M25171_IG23:
       33C0                 xor      eax, eax

G_M25171_IG24:
       C5F877               vzeroupper
       488D65D8             lea      rsp, [rbp-28H]
       5B                   pop      rbx
       415C                 pop      r12
       415D                 pop      r13
       415E                 pop      r14
       415F                 pop      r15
       5D                   pop      rbp
       C3                   ret

G_M25171_IG25:
       443B55D0             cmp      r10d, dword ptr [rbp-30H]
       0F95C0               setne    al
       0FB6C0               movzx    rax, al
       8B7D10               mov      edi, dword ptr [rbp+10H]
       400FB6FF             movzx    rdi, dil
       85C7                 test     eax, edi
       7552                 jne      SHORT G_M25171_IG29
       498BC6               mov      rax, r14
       482BC6               sub      rax, rsi
       418900               mov      dword ptr [r8], eax
       498BC7               mov      rax, r15
       482BC1               sub      rax, rcx
       418901               mov      dword ptr [r9], eax
       B801000000           mov      eax, 1

G_M25171_IG26:
       C5F877               vzeroupper
       488D65D8             lea      rsp, [rbp-28H]
       5B                   pop      rbx
       415C                 pop      r12
       415D                 pop      r13
       415E                 pop      r14
       415F                 pop      r15
       5D                   pop      rbp
       C3                   ret

G_M25171_IG27:
       498BC6               mov      rax, r14
       482BC6               sub      rax, rsi
       418900               mov      dword ptr [r8], eax
       498BC7               mov      rax, r15
       482BC1               sub      rax, rcx
       418901               mov      dword ptr [r9], eax
       B802000000           mov      eax, 2

G_M25171_IG28:
       C5F877               vzeroupper
       488D65D8             lea      rsp, [rbp-28H]
       5B                   pop      rbx
       415C                 pop      r12
       415D                 pop      r13
       415E                 pop      r14
       415F                 pop      r15
       5D                   pop      rbp
       C3                   ret

G_M25171_IG29:
       498BC6               mov      rax, r14
       482BC6               sub      rax, rsi
       418900               mov      dword ptr [r8], eax
       498BC7               mov      rax, r15
       482BC1               sub      rax, rcx
       418901               mov      dword ptr [r9], eax
       B803000000           mov      eax, 3

G_M25171_IG30:
       C5F877               vzeroupper
       488D65D8             lea      rsp, [rbp-28H]
       5B                   pop      rbx
       415C                 pop      r12
       415D                 pop      r13
       415E                 pop      r14
       415F                 pop      r15
       5D                   pop      rbp
       C3                   ret

G_M25171_IG31:
       33FF                 xor      edi, edi
       E84BE7FFFF           call     ThrowHelper:ThrowArgumentOutOfRangeException(int)
       CC                   int3

; Total bytes of code 1374, prolog size 51 for method Base64:DecodeFromUtf8(struct,struct,byref,byref,bool):int
; ============================================================

Sorry for some noise with force-pushing. Had copy & paste errors from the benchmark-project...

gfoidl added 2 commits May 27, 2019 17:23
Improved the version done in c8b6cb3, so the static data isn't needed and code is more compact and readable.
Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I think this can be merged once tests complete

@tannergooding tannergooding merged commit 036e0a6 into dotnet:master May 29, 2019
@tannergooding
Copy link
Member

Thanks @gfoidl! 🎉

@gfoidl gfoidl deleted the base64-simd branch May 29, 2019 15:35
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
* Optimized scalar code-path

* Fixed label names

* Implemented vectorized versions

* Added reference to source of algorithm

* Added back missing namespace

* Unsafe.Add instead of Unsafe.Subtract

Fixed build-failure (https://ci3.dot.net/job/dotnet_corefx/job/master/job/linux-musl-TGroup_netcoreapp+CGroup_Debug+AGroup_x64+TestOuter_false_prtest/8247/console)
Seems like the internal Unsafe doesn't have a Subtract method, so use Add instead.

* Added THIRD-PARTY-NOTICES

* PR Feedback

* THIRD-PARTY-NOTICES in repo-base instead instead in folder

Cf. dotnet/corefx#34529 (comment)

* PR Feedback

* dotnet/corefx#34529 (comment)
* dotnet/corefx#34529 (comment)

* Rewritten to use raw-pointers instead of GC-tracked refs

Cf. dotnet/corefx#34529 (comment)

* Initialized the static fields directly (i.e. w/o cctor)

Cf. dotnet/corefx#34529 (comment)

* Added a test for decoding a (encoded) Guid

The case with decoding encoded 16 bytes was not covered by tests, so a wrong code got commited before, resulting
in DestinationTooSmall instead of the correct Done.

* EncodingMap / DecodingMap as byref instead of pointer

So got rid of the `rep stosd` in the prolog. Cf. dotnet/corefx#34529 (comment)

* PR Feedback

* dotnet/corefx#34529 (comment)

* Debug.Fail instead throwing for the assertion

Cf. dotnet/corefx#34529 (comment)

* ROSpan for static data

* ROS for lookup maps

* In decode avoided stack spill and hoisted zero-vector outside the loops

Cf. dotnet/corefx#34529 (comment)

* Assert assumption about destLength

Cf. dotnet/corefx#34529 (comment)

* Added comments from original source and some changes to variable names

Cf. dotnet/corefx#34529 (comment) and dotnet/corefx#34529 (comment)

* Use TestZ instead of MoveMask in AVX2-path

Cf. dotnet/corefx#34529 (comment)

* Fixed too complicated mask2F creation

Improved the version done in dotnet/corefx@c8b6cb3, so the static data isn't needed and code is more compact and readable.


Commit migrated from dotnet/corefx@036e0a6
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

base64 encoding with simd-support